Learning Objectives
By the end of this section, you will be able to:
π Core Knowledge
- β’ Design and analyze A/B tests for ML model comparison
- β’ Calculate required sample sizes for desired power
- β’ Understand the peeking problem and its consequences
- β’ Distinguish frequentist vs Bayesian A/B testing approaches
- β’ Compare A/B testing with multi-armed bandit strategies
π§ Practical Skills
- β’ Implement A/B tests for model deployment decisions
- β’ Choose appropriate metrics for ML experiments
- β’ Handle multiple metrics and guardrail metrics
- β’ Avoid common pitfalls in online experimentation
π§ Real-World Applications
- β’ Model Deployment - Validate that new ML models improve key business metrics before full rollout
- β’ Recommendation Systems - Test different ranking algorithms at companies like Netflix, Spotify, Amazon
- β’ Search Engines - Google runs 10,000+ A/B tests per year to optimize search quality
- β’ Personalization - Validate that personalized experiences outperform generic ones
Central Message: A/B testing is the gold standard for making causal claims about whether a new ML model or feature improves business outcomes. Understanding its statistical foundations is essential for any ML engineer deploying models in production.
The Big Picture: Why A/B Testing Matters for ML
You've built a new recommendation model. Offline metrics (AUC, precision, recall) look great on your holdout set. But here's the uncomfortable truth: offline metrics often don't translate to real-world improvements. Users behave differently with live systems. Your model might have higher precision but introduce latency that frustrates users. It might optimize for clicks but reduce long-term engagement.
The Fundamental Question
"Does my new model actually improve business outcomes when deployed to real users, or could the apparent improvement be due to chance?"
A/B testing (also called online controlled experiments or randomized controlled trials) provides the only rigorous way to answer this question. By randomly assigning users to either the current model (control) or the new model (treatment), we can make causal inferences about the effect of our changes.
From Fisher to Silicon Valley
A Brief History of A/B Testing
1920s-1930s: Ronald Fisher develops randomized controlled experiments for agricultural research. His work on hypothesis testing laid the statistical foundations we still use today.
1950s-1990s: Clinical trials adopt these methods rigorously. Pharmaceutical companies use randomized controlled trials to test drug efficacy.
2000: Google engineer's first A/B test on ad colors launches the modern era of web experimentation. Today, tech companies run thousands of simultaneous experiments.
"At Google, experimentation is practically a religion." β Diane Tang, Google
| Company | Experiments/Year | Notable Use Case |
|---|---|---|
| ~10,000+ | Search ranking, ad optimization | |
| Microsoft (Bing) | ~15,000 | Search features, UI changes |
| Netflix | ~300 | Recommendation algorithms, UI |
| Amazon | ~1,000+ | Product recommendations, pricing |
| Meta | ~10,000+ | News feed ranking, ad targeting |
The A/B Testing Framework
An A/B test is a randomized controlled experiment with the following structure:
- Define the hypothesis: What improvement do you expect from the new model?
- Choose metrics: Select a primary metric and guardrail metrics
- Calculate sample size: Determine how many users you need
- Randomize: Randomly assign users to control (A) or treatment (B)
- Collect data: Run the experiment for the predetermined duration
- Analyze: Perform hypothesis test and interpret results
- Decide: Ship, iterate, or abandon based on statistical evidence
Hypothesis Formulation for ML Metrics
For ML model comparisons, we typically test whether the treatment model improves a key business metric. Let be the metric value for control and for treatment.
Standard A/B Test Hypotheses
Null Hypothesis (Hβ)
The models have the same effect on the metric
Alternative Hypothesis (Hβ)
The models have different effects (two-tailed)
Statistical Tests for A/B Testing
For comparing proportions (conversion rates, click-through rates), we use the two-proportion z-test:
Two-Proportion Z-Test
Pooled proportion:
Standard error:
Z-statistic:
where are successes and are sample sizes
Interactive: A/B Test Simulator
Run a simulated A/B test comparing two ML models. Adjust the true conversion rates and watch how sample size affects the ability to detect a real difference.
π¬A/B Testing Simulator: ML Model Comparison
Simulate an A/B test comparing two ML models. Model A is the current production model (control), Model B is a new model (treatment). The conversion rate represents the proportion of users who take a desired action (click, purchase, etc.).
βοΈ Simulation Parameters (True Rates - Unknown in Real Life)
Model A (Control - Current)
Model B (Treatment - New)
π§ How This Works in Real ML Systems
Hβ: pA = pB (Models have the same conversion rate)
Hβ: pA β pB (Models have different conversion rates)
Companies like Google, Netflix, Meta, and Amazon use this exact framework to:
- Validate new ML model improvements before full deployment
- Test UI changes that affect user engagement
- Compare recommendation algorithms
- Measure the impact of ranking changes
β οΈ Warning: Peeking Problem
In real A/B tests, do NOT stop the test early just because you see a significant p-value! Repeated checking inflates the false positive rate (this is called the "peeking problem"). Either use sequential testing methods or pre-determine your sample size and only check significance once at the end.
Power Analysis and Sample Size
One of the most critical decisions in A/B testing is determining how many users you need. Too few, and you won't detect real improvements. Too many, and you're wasting traffic on experiments when you could be shipping.
Sample Size Formula (Per Group)
Simplified approximation: for 80% power at Ξ± = 0.05
| Parameter | Symbol | Typical Value | Effect on Sample Size |
|---|---|---|---|
| Significance level | Ξ± | 0.05 | Lower Ξ± β larger n |
| Power | 1-Ξ² | 0.80 | Higher power β larger n |
| Effect size | Ξ΄ = p_B - p_A | 1-5% | Smaller effect β much larger n |
| Baseline rate | p_A | Varies | Affects variance |
Interactive: Sample Size Calculator
Use this calculator to plan your A/B test. Enter your baseline conversion rate, desired minimum detectable effect, and see the required sample size.
πSample Size & Power Calculator for A/B Tests
Plan your A/B test by calculating the required sample size, statistical power, or minimum detectable effect. All calculations assume a two-tailed test for proportion differences.
What do you want to calculate?
Test Parameters
Specify MDE & Power
Required Sample Size
To detect a 2.0% absolute change (20% relative) with 80% power at Ξ± = 0.05
Power Curve: How Sample Size Affects Power
π‘ Key Insights for ML Engineers
- β’ Smaller effects need more data: A 1% improvement needs ~4x more samples than a 2% improvement
- β’ Industry standard: 80% power and Ξ± = 0.05 are conventional choices
- β’ Traffic constraints: If you can't reach the required sample size, either accept lower power or focus on larger improvements
- β’ Multiple metrics: Running tests on multiple metrics requires correction (Bonferroni, etc.) which increases sample size requirements
Testing ML-Specific Metrics
ML systems often require testing multiple metrics simultaneously. A single "primary metric" doesn't capture all the effects of a model change.
Primary Metrics
The main metric you're trying to improve. The experiment decision is based on this.
- β’ Click-through rate (CTR)
- β’ Conversion rate
- β’ Revenue per user
- β’ Engagement (time spent, actions)
Guardrail Metrics
Metrics that must not regress, even if the primary metric improves.
- β’ Latency / page load time
- β’ Error rates
- β’ User retention
- β’ Long-term engagement
- Model-specific metrics (AUC, precision, recall) measured on live traffic
- Prediction confidence/calibration
- Feature coverage (does the new model handle more cases?)
- Model latency (inference time)
Interactive: Multi-Metric ML Testing
Test a new ML model across multiple metrics simultaneously. See how improvements in one metric might come with tradeoffs in others.
π―ML Model A/B Testing: Multi-Metric Comparison
Compare ML models across multiple metrics. Set the "true" performance values (unknown in real life), then run the experiment to see how observed metrics converge and when statistical significance is reached.
Select Primary Metric for Testing
Model A (Baseline) - True Values
Model B (Challenger) - True Values
Model A (Baseline)
Model B (Challenger)
The Peeking Problem
One of the most common mistakes in A/B testing is peekingβchecking results multiple times during the experiment and stopping early when you see significance.
β οΈ Why Peeking is Dangerous
When you check significance k times during an experiment, your true false positive rate is approximately:
Checking daily for a month with Ξ± = 0.05 gives you a ~78% chance of a false positive!
Interactive: Peeking Simulation
See the peeking problem in action. This simulation runs experiments where Hβ is true (both groups have the same rate), showing how repeated checking inflates false positives.
πThe Peeking Problem: How Repeated Checking Inflates False Positives
When Hβ is true (both variants have the same conversion rate), repeated checking at Ξ± = 0.05 dramatically increases the chance of seeing a "significant" result. This simulation demonstrates the real-world consequences of peeking.
Simulation Settings
Peeking Behavior
Total peeks per experiment: 10
Ready to demonstrate the peeking problem
Click "Run Simulations" to see how checking significance repeatedly dramatically inflates false positives when Hβ is true.
Solutions to the Peeking Problem
1. Fixed Sample Size
Pre-register sample size and only analyze once at the end.
2. Sequential Testing
Use O'Brien-Fleming or Pocock boundaries that account for multiple looks.
3. Bayesian Methods
Bayesian A/B testing doesn't have the same peeking penalty.
4. Ξ±-Spending Functions
Distribute Ξ± budget across interim analyses (e.g., Lan-DeMets).
Bayesian A/B Testing
Bayesian A/B testing offers an alternative paradigm that many find more intuitive and practical for real-world ML applications.
Frequentist Approach
- Answer: "Is the difference significant?" (Yes/No)
- Output: p-value, confidence interval
- Requires fixed sample size
- Peeking inflates false positive rate
- Cannot say "B is 90% likely to be better"
Bayesian Approach
- Answer: "What's P(B > A)?" (probability)
- Output: posterior distributions, credible intervals
- Can stop anytime based on decision criteria
- Natural handling of continuous monitoring
- Directly answers business questions
Bayesian A/B Test with Beta Priors
Prior (uninformative):
Posterior after data:
Probability B is better (via Monte Carlo):
Interactive: Bayesian A/B Testing
Watch Bayesian inference in action. As you collect more data, the posterior distributions become more concentrated, and you can see the probability that B outperforms A.
A/B Testing Simulator
Compare two variants using Bayesian inference with Beta distributions
A (Control)
B (Treatment)
Why Bayesian A/B Testing?
- - Get a probability that B is better, not just "significant" or not
- - Can stop early when confident enough (no peeking penalty)
- - Works with small sample sizes through prior information
- - Naturally handles uncertainty - wider curves = more uncertain
- When you want to continuously monitor results without peeking penalties
- When stakeholders want probabilities rather than p-values
- When you have prior information about likely effect sizes
- For faster decisions when resources are limited
A/B Testing vs Multi-Armed Bandits
A/B testing and multi-armed bandits represent different philosophies for online experimentation:
| Aspect | A/B Testing | Multi-Armed Bandits |
|---|---|---|
| Allocation | Fixed (50/50) | Adaptive (explore/exploit) |
| Goal | Statistical inference | Minimize regret |
| Traffic efficiency | Lower | Higher |
| Statistical rigor | High | Lower |
| Stopping rule | Fixed sample size | Continuous |
| Best for | Hypothesis testing | Optimization |
Regret is the key concept in bandit algorithmsβit measures the cumulative reward lost by not always showing the best arm:
where is the mean reward of the best arm.
Interactive: Bandits Comparison
Compare A/B testing with bandit algorithms (Thompson Sampling, UCB1, Ξ΅-Greedy). See how bandits allocate more traffic to the winning arm over time, reducing regret.
π°A/B Testing vs Multi-Armed Bandits
Compare A/B testing (equal allocation) with adaptive strategies (Thompson Sampling, UCB1, Ξ΅-Greedy). Bandits minimize regretβthe reward lost by not always choosing the best arm.
True Conversion Rates
Experiment Settings
π‘ When to Use Each Approach
Use A/B Testing When:
- You need rigorous statistical inference
- Regulatory requirements apply (e.g., medical)
- You want interpretable p-values and CIs
- Long-term effects matter (not just immediate reward)
Use Bandits When:
- Minimizing regret (lost reward) is critical
- You need faster convergence to the winner
- Traffic is limited and every conversion counts
- Continuous optimization (not one-time test)
- When you need rigorous p-values or confidence intervals
- When regulatory requirements apply (medical, financial)
- When delayed conversions make immediate feedback impossible
- When you need to measure effect size precisely
Practical Considerations
Python Implementation
Let's implement a complete A/B testing analysis pipeline in Python:
Now let's add power analysis and Bayesian methods:
1import numpy as np
2from scipy import stats
3
4# ================================================
5# 1. Sample Size Calculation
6# ================================================
7
8def sample_size_for_proportion(baseline_rate, mde, alpha=0.05, power=0.80):
9 """
10 Calculate sample size per group for A/B test on proportions.
11
12 Parameters
13 ----------
14 baseline_rate : float
15 Expected conversion rate of control group
16 mde : float
17 Minimum detectable effect (absolute difference)
18 alpha : float
19 Significance level (Type I error rate)
20 power : float
21 Desired power (1 - Type II error rate)
22
23 Returns
24 -------
25 int : Required sample size per group
26 """
27 p1 = baseline_rate
28 p2 = baseline_rate + mde
29
30 z_alpha = stats.norm.ppf(1 - alpha/2)
31 z_beta = stats.norm.ppf(power)
32
33 p_bar = (p1 + p2) / 2
34
35 n = (z_alpha * np.sqrt(2*p_bar*(1-p_bar)) +
36 z_beta * np.sqrt(p1*(1-p1) + p2*(1-p2)))**2 / (p2 - p1)**2
37
38 return int(np.ceil(n))
39
40# Example
41n = sample_size_for_proportion(baseline_rate=0.10, mde=0.02)
42print(f"Required sample size per group: {n:,}")
43
44
45# ================================================
46# 2. Bayesian A/B Testing
47# ================================================
48
49def bayesian_ab_test(successes_a, n_a, successes_b, n_b,
50 prior_alpha=1, prior_beta=1, n_samples=100000):
51 """
52 Bayesian A/B test using Beta-Binomial model.
53
54 Returns probability that B is better than A.
55 """
56 # Posterior distributions (conjugate update)
57 alpha_a = prior_alpha + successes_a
58 beta_a = prior_beta + (n_a - successes_a)
59
60 alpha_b = prior_alpha + successes_b
61 beta_b = prior_beta + (n_b - successes_b)
62
63 # Monte Carlo sampling
64 samples_a = np.random.beta(alpha_a, beta_a, n_samples)
65 samples_b = np.random.beta(alpha_b, beta_b, n_samples)
66
67 # P(B > A)
68 prob_b_better = np.mean(samples_b > samples_a)
69
70 # Expected lift
71 expected_lift = np.mean((samples_b - samples_a) / samples_a) * 100
72
73 # Credible intervals
74 ci_a = np.percentile(samples_a, [2.5, 97.5])
75 ci_b = np.percentile(samples_b, [2.5, 97.5])
76 ci_diff = np.percentile(samples_b - samples_a, [2.5, 97.5])
77
78 return {
79 'prob_b_better': prob_b_better,
80 'expected_lift_percent': expected_lift,
81 'posterior_mean_a': samples_a.mean(),
82 'posterior_mean_b': samples_b.mean(),
83 'ci_95_a': ci_a,
84 'ci_95_b': ci_b,
85 'ci_95_diff': ci_diff
86 }
87
88# Example
89result = bayesian_ab_test(
90 successes_a=120, n_a=1000, # 12% conversion
91 successes_b=145, n_b=1000 # 14.5% conversion
92)
93print(f"P(B > A) = {result['prob_b_better']:.1%}")
94print(f"Expected lift: {result['expected_lift_percent']:.1f}%")
95
96
97# ================================================
98# 3. Sequential Testing (O'Brien-Fleming)
99# ================================================
100
101def obrien_fleming_boundary(alpha, n_looks, look_number):
102 """
103 Calculate O'Brien-Fleming boundary for sequential testing.
104
105 Returns z-score threshold for rejecting H0 at this look.
106 """
107 if look_number > n_looks:
108 raise ValueError("look_number cannot exceed n_looks")
109
110 # O'Brien-Fleming spending function
111 t = look_number / n_looks # Information fraction
112
113 # Approximate z-boundary using spending function
114 z_alpha_full = stats.norm.ppf(1 - alpha/2)
115 z_boundary = z_alpha_full / np.sqrt(t)
116
117 return z_boundary
118
119# Example: 3 interim looks + 1 final
120for look in [1, 2, 3, 4]:
121 boundary = obrien_fleming_boundary(alpha=0.05, n_looks=4, look_number=look)
122 print(f"Look {look}: |z| > {boundary:.3f} to reject H0")
123
124
125# ================================================
126# 4. Complete A/B Test Analysis
127# ================================================
128
129def analyze_ab_test(successes_a, n_a, successes_b, n_b, alpha=0.05):
130 """Complete A/B test analysis with both frequentist and Bayesian results."""
131
132 # Frequentist analysis
133 p_a = successes_a / n_a
134 p_b = successes_b / n_b
135
136 p_pooled = (successes_a + successes_b) / (n_a + n_b)
137 se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
138 z = (p_b - p_a) / se
139 p_value = 2 * (1 - stats.norm.cdf(abs(z)))
140
141 se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
142 z_crit = stats.norm.ppf(1 - alpha/2)
143 ci = (p_b - p_a - z_crit*se_diff, p_b - p_a + z_crit*se_diff)
144
145 # Bayesian analysis
146 bayes = bayesian_ab_test(successes_a, n_a, successes_b, n_b)
147
148 # Print report
149 print("=" * 60)
150 print("A/B TEST ANALYSIS REPORT")
151 print("=" * 60)
152 print(f"\nSample Sizes: Control = {n_a:,}, Treatment = {n_b:,}")
153 print(f"Conversions: Control = {successes_a:,}, Treatment = {successes_b:,}")
154 print(f"\nConversion Rates:")
155 print(f" Control: {p_a*100:.2f}%")
156 print(f" Treatment: {p_b*100:.2f}%")
157 print(f" Lift: {(p_b-p_a)/p_a*100:+.2f}%")
158 print(f"\nFrequentist Results:")
159 print(f" z-statistic: {z:.3f}")
160 print(f" p-value: {p_value:.4f}")
161 print(f" 95% CI: [{ci[0]*100:.2f}%, {ci[1]*100:.2f}%]")
162 print(f" Significant: {'Yes' if p_value < alpha else 'No'} (Ξ± = {alpha})")
163 print(f"\nBayesian Results:")
164 print(f" P(B > A): {bayes['prob_b_better']*100:.1f}%")
165 print(f" 95% Credible Interval: [{bayes['ci_95_diff'][0]*100:.2f}%, {bayes['ci_95_diff'][1]*100:.2f}%]")
166
167 if bayes['prob_b_better'] > 0.95:
168 print("\nβ
RECOMMENDATION: Ship treatment (high confidence B is better)")
169 elif bayes['prob_b_better'] < 0.05:
170 print("\nβ RECOMMENDATION: Keep control (high confidence A is better)")
171 else:
172 print("\nβ³ RECOMMENDATION: Continue collecting data")
173
174# Example usage
175analyze_ab_test(
176 successes_a=1200, n_a=10000, # 12% baseline
177 successes_b=1350, n_b=10000 # 13.5% treatment
178)Knowledge Check
Test your understanding of A/B testing for ML applications with this quiz.
Knowledge Check
A company runs an A/B test comparing two recommendation models. After 1 week, the p-value is 0.08. After 2 weeks, it drops to 0.03. What is the correct interpretation?
Summary
Key Takeaways
- A/B testing is essential for ML deployment: Offline metrics don't always predict online performance. Randomized experiments provide causal evidence.
- Power analysis is critical: Calculate sample size before starting. Effect size scales inversely with the square of sample size.
- Beware the peeking problem: Repeated checking at Ξ± = 0.05 inflates false positive rates dramatically. Use sequential testing or Bayesian methods.
- Multiple metrics need multiple considerations: Define primary and guardrail metrics. Consider corrections for multiple testing.
- Bayesian A/B testing offers practical advantages: Continuous monitoring, probability statements, and faster decisionsβbut requires different interpretation.
- Bandits optimize differently than A/B tests: Multi-armed bandits minimize regret but sacrifice statistical rigor. Choose based on your goals.
Quick Reference
| Topic | Key Formula / Rule |
|---|---|
| Z-test statistic | z = (p_B - p_A) / β[pΜ(1-pΜ)(1/n_A + 1/n_B)] |
| Sample size (approx) | n β 16ΟΒ²/δ² for 80% power, Ξ±=0.05 |
| False positive with k peeks | P(FP) β 1 - (1-Ξ±)^k |
| Bayesian posterior | p | data ~ Beta(1 + successes, 1 + failures) |
| Regret | TΒ·ΞΌ* - Ξ£ E[r_t] |
Industry Best Practices
Before the Test
- β’ Define hypothesis and success criteria
- β’ Calculate required sample size
- β’ Pre-register analysis plan
- β’ Set up proper logging and metrics
During and After
- β’ Check for sample ratio mismatch
- β’ Don't peek without correction
- β’ Run long enough for novelty effects
- β’ Document and share learnings
Looking Ahead: In the next section, we'll explore Sequential Testing, which provides a principled framework for monitoring experiments over time while maintaining proper error control.