Boo-AI — Master Artificial Intelligence by Building from Scratch

Learning Objectives

By the end of this section, you will be able to:

📚 Core Knowledge

• Design and analyze A/B tests for ML model comparison
• Calculate required sample sizes for desired power
• Understand the peeking problem and its consequences
• Distinguish frequentist vs Bayesian A/B testing approaches
• Compare A/B testing with multi-armed bandit strategies

🔧 Practical Skills

• Implement A/B tests for model deployment decisions
• Choose appropriate metrics for ML experiments
• Handle multiple metrics and guardrail metrics
• Avoid common pitfalls in online experimentation

🧠 Real-World Applications

• Model Deployment - Validate that new ML models improve key business metrics before full rollout
• Recommendation Systems - Test different ranking algorithms at companies like Netflix, Spotify, Amazon
• Search Engines - Google runs 10,000+ A/B tests per year to optimize search quality
• Personalization - Validate that personalized experiences outperform generic ones

Central Message: A/B testing is the gold standard for making causal claims about whether a new ML model or feature improves business outcomes. Understanding its statistical foundations is essential for any ML engineer deploying models in production.

The Big Picture: Why A/B Testing Matters for ML

You've built a new recommendation model. Offline metrics (AUC, precision, recall) look great on your holdout set. But here's the uncomfortable truth: offline metrics often don't translate to real-world improvements. Users behave differently with live systems. Your model might have higher precision but introduce latency that frustrates users. It might optimize for clicks but reduce long-term engagement.

The Fundamental Question

"Does my new model actually improve business outcomes when deployed to real users, or could the apparent improvement be due to chance?"

A/B testing (also called online controlled experiments or randomized controlled trials) provides the only rigorous way to answer this question. By randomly assigning users to either the current model (control) or the new model (treatment), we can make causal inferences about the effect of our changes.

From Fisher to Silicon Valley

📜

A Brief History of A/B Testing

1920s-1930s: Ronald Fisher develops randomized controlled experiments for agricultural research. His work on hypothesis testing laid the statistical foundations we still use today.

1950s-1990s: Clinical trials adopt these methods rigorously. Pharmaceutical companies use randomized controlled trials to test drug efficacy.

2000: Google engineer's first A/B test on ad colors launches the modern era of web experimentation. Today, tech companies run thousands of simultaneous experiments.

"At Google, experimentation is practically a religion." — Diane Tang, Google

Company	Experiments/Year	Notable Use Case
Google	~10,000+	Search ranking, ad optimization
Microsoft (Bing)	~15,000	Search features, UI changes
Netflix	~300	Recommendation algorithms, UI
Amazon	~1,000+	Product recommendations, pricing
Meta	~10,000+	News feed ranking, ad targeting

The A/B Testing Framework

An A/B test is a randomized controlled experiment with the following structure:

Define the hypothesis: What improvement do you expect from the new model?
Choose metrics: Select a primary metric and guardrail metrics
Calculate sample size: Determine how many users you need
Randomize: Randomly assign users to control (A) or treatment (B)
Collect data: Run the experiment for the predetermined duration
Analyze: Perform hypothesis test and interpret results
Decide: Ship, iterate, or abandon based on statistical evidence

Hypothesis Formulation for ML Metrics

For ML model comparisons, we typically test whether the treatment model improves a key business metric. Let $p_A$ be the metric value for control and $p_B$ for treatment.

Standard A/B Test Hypotheses

Null Hypothesis (H₀)

p_A = p_B

The models have the same effect on the metric

Alternative Hypothesis (H₁)

p_A \neq p_B

The models have different effects (two-tailed)

Statistical Tests for A/B Testing

For comparing proportions (conversion rates, click-through rates), we use the two-proportion z-test:

Two-Proportion Z-Test

Pooled proportion:

\hat{p} = \frac{X_A + X_B}{n_A + n_B}

Standard error:

SE = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}

Z-statistic:

z = \frac{\hat{p}_B - \hat{p}_A}{SE}

where $X_A, X_B$ are successes and $n_A, n_B$ are sample sizes

Interactive: A/B Test Simulator

Run a simulated A/B test comparing two ML models. Adjust the true conversion rates and watch how sample size affects the ability to detect a real difference.

🔬A/B Testing Simulator: ML Model Comparison

Simulate an A/B test comparing two ML models. Model A is the current production model (control), Model B is a new model (treatment). The conversion rate represents the proportion of users who take a desired action (click, purchase, etc.).

⚙️ Simulation Parameters (True Rates - Unknown in Real Life)

Model A True Rate: 10.0%

Model B True Rate: 12.0%

Significance Level (α): 0.05

True difference: 2.0% absolute, 20.0% relative lift

Batch Size:

Model A (Control - Current)

Users

Conversions

Rate

0.00%

Model B (Treatment - New)

Users

Conversions

Rate

0.00%

🧠 How This Works in Real ML Systems

H₀: p_A = p_B (Models have the same conversion rate)

H₁: p_A ≠ p_B (Models have different conversion rates)

Companies like Google, Netflix, Meta, and Amazon use this exact framework to:

Validate new ML model improvements before full deployment
Test UI changes that affect user engagement
Compare recommendation algorithms
Measure the impact of ranking changes

⚠️ Warning: Peeking Problem

In real A/B tests, do NOT stop the test early just because you see a significant p-value! Repeated checking inflates the false positive rate (this is called the "peeking problem"). Either use sequential testing methods or pre-determine your sample size and only check significance once at the end.

Power Analysis and Sample Size

One of the most critical decisions in A/B testing is determining how many users you need. Too few, and you won't detect real improvements. Too many, and you're wasting traffic on experiments when you could be shipping.

Sample Size Formula (Per Group)

n = \frac{2\left(z_{\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_\beta\sqrt{p_A(1-p_A) + p_B(1-p_B)}\right)^2}{(p_B - p_A)^2}

Simplified approximation: $n \approx \frac{16\sigma^2}{\delta^2}$ for 80% power at α = 0.05

Parameter	Symbol	Typical Value	Effect on Sample Size
Significance level	α	0.05	Lower α → larger n
Power	1-β	0.80	Higher power → larger n
Effect size	δ = p_B - p_A	1-5%	Smaller effect → much larger n
Baseline rate	p_A	Varies	Affects variance

The Effect Size Trap: Sample size scales with the inverse square of the effect size. Detecting a 1% improvement requires 4× more samples than detecting a 2% improvement. This is why power analysis before the experiment is critical.

Interactive: Sample Size Calculator

Use this calculator to plan your A/B test. Enter your baseline conversion rate, desired minimum detectable effect, and see the required sample size.

📊Sample Size & Power Calculator for A/B Tests

Plan your A/B test by calculating the required sample size, statistical power, or minimum detectable effect. All calculations assume a two-tailed test for proportion differences.

What do you want to calculate?

Test Parameters

Baseline Conversion Rate10.0%

Significance Level (α)0.05

Specify MDE & Power

Minimum Detectable Effect (absolute)2.0% (20% relative)

Desired Power (1 - β)80%

Required Sample Size

7,682 per group

15,364 total users

To detect a 2.0% absolute change (20% relative) with 80% power at α = 0.05

Power Curve: How Sample Size Affects Power

💡 Key Insights for ML Engineers

• Smaller effects need more data: A 1% improvement needs ~4x more samples than a 2% improvement
• Industry standard: 80% power and α = 0.05 are conventional choices
• Traffic constraints: If you can't reach the required sample size, either accept lower power or focus on larger improvements
• Multiple metrics: Running tests on multiple metrics requires correction (Bonferroni, etc.) which increases sample size requirements

Testing ML-Specific Metrics

ML systems often require testing multiple metrics simultaneously. A single "primary metric" doesn't capture all the effects of a model change.

Primary Metrics

The main metric you're trying to improve. The experiment decision is based on this.

• Click-through rate (CTR)
• Conversion rate
• Revenue per user
• Engagement (time spent, actions)

Guardrail Metrics

Metrics that must not regress, even if the primary metric improves.

• Latency / page load time
• Error rates
• User retention
• Long-term engagement

ML Model Metrics: When comparing ML models, you might also track:

Model-specific metrics (AUC, precision, recall) measured on live traffic
Prediction confidence/calibration
Feature coverage (does the new model handle more cases?)
Model latency (inference time)

Interactive: Multi-Metric ML Testing

Test a new ML model across multiple metrics simultaneously. See how improvements in one metric might come with tradeoffs in others.

🎯ML Model A/B Testing: Multi-Metric Comparison

Compare ML models across multiple metrics. Set the "true" performance values (unknown in real life), then run the experiment to see how observed metrics converge and when statistical significance is reached.

Select Primary Metric for Testing

✓ Higher is better - testing if Model B improves this metric

Model A (Baseline) - True Values

AUC-ROC: 0.8200

Model B (Challenger) - True Values

AUC-ROC: 0.8500

Significance Level (α): 0.05

Model A (Baseline)

Observations

Observed AUC-ROC

Model B (Challenger)

Observations

Observed AUC-ROC

The Peeking Problem

One of the most common mistakes in A/B testing is peeking—checking results multiple times during the experiment and stopping early when you see significance.

⚠️ Why Peeking is Dangerous

When you check significance k times during an experiment, your true false positive rate is approximately:

P(at least one false positive) ≈ 1 - (1 - α)^k

k = 1

k = 5

~23%

k = 10

~40%

k = 20

~64%

Checking daily for a month with α = 0.05 gives you a ~78% chance of a false positive!

Interactive: Peeking Simulation

See the peeking problem in action. This simulation runs experiments where H₀ is true (both groups have the same rate), showing how repeated checking inflates false positives.

👀The Peeking Problem: How Repeated Checking Inflates False Positives

When H₀ is true (both variants have the same conversion rate), repeated checking at α = 0.05 dramatically increases the chance of seeing a "significant" result. This simulation demonstrates the real-world consequences of peeking.

Simulation Settings

Number of Simulations100

Significance Level (α)0.05

Peeking Behavior

Peek Every N Users100

Max Sample Size1000

Total peeks per experiment: 10

Ready to demonstrate the peeking problem

Click "Run Simulations" to see how checking significance repeatedly dramatically inflates false positives when H₀ is true.

Solutions to the Peeking Problem

1. Fixed Sample Size

Pre-register sample size and only analyze once at the end.

2. Sequential Testing

Use O'Brien-Fleming or Pocock boundaries that account for multiple looks.

3. Bayesian Methods

Bayesian A/B testing doesn't have the same peeking penalty.

4. α-Spending Functions

Distribute α budget across interim analyses (e.g., Lan-DeMets).

Bayesian A/B Testing

Bayesian A/B testing offers an alternative paradigm that many find more intuitive and practical for real-world ML applications.

Frequentist Approach

Answer: "Is the difference significant?" (Yes/No)
Output: p-value, confidence interval
Requires fixed sample size
Peeking inflates false positive rate
Cannot say "B is 90% likely to be better"

Bayesian Approach

Answer: "What's P(B > A)?" (probability)
Output: posterior distributions, credible intervals
Can stop anytime based on decision criteria
Natural handling of continuous monitoring
Directly answers business questions

Bayesian A/B Test with Beta Priors

Prior (uninformative):

p_A, p_B \sim \text{Beta}(1, 1) = \text{Uniform}(0, 1)

Posterior after data:

p_A | \text{data} \sim \text{Beta}(1 + \text{successes}_A, 1 + \text{failures}_A)

Probability B is better (via Monte Carlo):

P(p_B > p_A | \text{data}) = \int\int \mathbf{1}[p_B > p_A] \, d\pi(p_A) \, d\pi(p_B)

Interactive: Bayesian A/B Testing

Watch Bayesian inference in action. As you collect more data, the posterior distributions become more concentrated, and you can see the probability that B outperforms A.

A/B Testing Simulator

Compare two variants using Bayesian inference with Beta distributions

A (Control) - True Rate???

1%30%

B (Treatment) - True Rate???

1%30%

Show true conversion rates (normally hidden in real tests!)

A (Control)

Visitors

Conversions

Rate

0.00%

Posterior: Beta(1, 1)

B (Treatment)

Visitors

Conversions

Rate

0.00%

Posterior: Beta(1, 1)

P(B better than A) = 50.0%

Not enough evidence yet

Why Bayesian A/B Testing?

- Get a probability that B is better, not just "significant" or not
- Can stop early when confident enough (no peeking penalty)
- Works with small sample sizes through prior information
- Naturally handles uncertainty - wider curves = more uncertain

When to Use Bayesian A/B Testing:

When you want to continuously monitor results without peeking penalties
When stakeholders want probabilities rather than p-values
When you have prior information about likely effect sizes
For faster decisions when resources are limited

A/B Testing vs Multi-Armed Bandits

A/B testing and multi-armed bandits represent different philosophies for online experimentation:

Aspect	A/B Testing	Multi-Armed Bandits
Allocation	Fixed (50/50)	Adaptive (explore/exploit)
Goal	Statistical inference	Minimize regret
Traffic efficiency	Lower	Higher
Statistical rigor	High	Lower
Stopping rule	Fixed sample size	Continuous
Best for	Hypothesis testing	Optimization

Regret is the key concept in bandit algorithms—it measures the cumulative reward lost by not always showing the best arm:

\text{Regret}(T) = T \cdot \mu^* - \sum_{t=1}^{T} \mathbb{E}[r_t]

where $\mu^*$ is the mean reward of the best arm.

Interactive: Bandits Comparison

Compare A/B testing with bandit algorithms (Thompson Sampling, UCB1, ε-Greedy). See how bandits allocate more traffic to the winning arm over time, reducing regret.

🎰A/B Testing vs Multi-Armed Bandits

Compare A/B testing (equal allocation) with adaptive strategies (Thompson Sampling, UCB1, ε-Greedy). Bandits minimize regret—the reward lost by not always choosing the best arm.

True Conversion Rates

Arm A8.0%

Arm B12.0%

Experiment Settings

Total Pulls1000

ε (exploration rate)0.1

💡 When to Use Each Approach

Use A/B Testing When:

You need rigorous statistical inference
Regulatory requirements apply (e.g., medical)
You want interpretable p-values and CIs
Long-term effects matter (not just immediate reward)

Use Bandits When:

Minimizing regret (lost reward) is critical
You need faster convergence to the winner
Traffic is limited and every conversion counts
Continuous optimization (not one-time test)

When NOT to Use Bandits:

When you need rigorous p-values or confidence intervals
When regulatory requirements apply (medical, financial)
When delayed conversions make immediate feedback impossible
When you need to measure effect size precisely

Practical Considerations

Python Implementation

Let's implement a complete A/B testing analysis pipeline in Python:

Two-Proportion Z-Test for A/B Testing

🐍python

Explanation(7)

Code(34)

Import scipy.stats for statistical distributions and hypothesis tests

Conversion rates: successes divided by total visitors for each group

Pooled conversion rate under the null hypothesis (groups are identical)

Standard error of the difference under the null hypothesis

z-statistic measures how many standard errors the observed difference is from zero

Two-tailed p-value: probability of seeing a difference this extreme if H₀ is true

Relative lift: percentage improvement of treatment over control

27 lines without explanation

1from scipy import stats
2import numpy as np
3
4def two_proportion_z_test(successes_a, n_a, successes_b, n_b):
5    """Two-proportion z-test for A/B testing."""
6
7    # Conversion rates
8    p_a = successes_a / n_a
9    p_b = successes_b / n_b
10
11    # Pooled proportion under H0
12    p_pooled = (successes_a + successes_b) / (n_a + n_b)
13
14    # Standard error under H0
15    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
16
17    # Z-statistic
18    z = (p_b - p_a) / se
19
20    # Two-tailed p-value
21    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
22
23    # Confidence interval for the difference
24    se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
25    ci = (p_b - p_a - 1.96*se_diff, p_b - p_a + 1.96*se_diff)
26
27    # Relative lift
28    lift = (p_b - p_a) / p_a * 100 if p_a > 0 else np.inf
29
30    return {
31        'p_a': p_a, 'p_b': p_b,
32        'z_statistic': z, 'p_value': p_value,
33        'ci_95': ci, 'lift_percent': lift
34    }

Now let's add power analysis and Bayesian methods:

🐍python

1import numpy as np
2from scipy import stats
3
4# ================================================
5# 1. Sample Size Calculation
6# ================================================
7
8def sample_size_for_proportion(baseline_rate, mde, alpha=0.05, power=0.80):
9    """
10    Calculate sample size per group for A/B test on proportions.
11
12    Parameters
13    ----------
14    baseline_rate : float
15        Expected conversion rate of control group
16    mde : float
17        Minimum detectable effect (absolute difference)
18    alpha : float
19        Significance level (Type I error rate)
20    power : float
21        Desired power (1 - Type II error rate)
22
23    Returns
24    -------
25    int : Required sample size per group
26    """
27    p1 = baseline_rate
28    p2 = baseline_rate + mde
29
30    z_alpha = stats.norm.ppf(1 - alpha/2)
31    z_beta = stats.norm.ppf(power)
32
33    p_bar = (p1 + p2) / 2
34
35    n = (z_alpha * np.sqrt(2*p_bar*(1-p_bar)) +
36         z_beta * np.sqrt(p1*(1-p1) + p2*(1-p2)))**2 / (p2 - p1)**2
37
38    return int(np.ceil(n))
39
40# Example
41n = sample_size_for_proportion(baseline_rate=0.10, mde=0.02)
42print(f"Required sample size per group: {n:,}")
43
44
45# ================================================
46# 2. Bayesian A/B Testing
47# ================================================
48
49def bayesian_ab_test(successes_a, n_a, successes_b, n_b,
50                     prior_alpha=1, prior_beta=1, n_samples=100000):
51    """
52    Bayesian A/B test using Beta-Binomial model.
53
54    Returns probability that B is better than A.
55    """
56    # Posterior distributions (conjugate update)
57    alpha_a = prior_alpha + successes_a
58    beta_a = prior_beta + (n_a - successes_a)
59
60    alpha_b = prior_alpha + successes_b
61    beta_b = prior_beta + (n_b - successes_b)
62
63    # Monte Carlo sampling
64    samples_a = np.random.beta(alpha_a, beta_a, n_samples)
65    samples_b = np.random.beta(alpha_b, beta_b, n_samples)
66
67    # P(B > A)
68    prob_b_better = np.mean(samples_b > samples_a)
69
70    # Expected lift
71    expected_lift = np.mean((samples_b - samples_a) / samples_a) * 100
72
73    # Credible intervals
74    ci_a = np.percentile(samples_a, [2.5, 97.5])
75    ci_b = np.percentile(samples_b, [2.5, 97.5])
76    ci_diff = np.percentile(samples_b - samples_a, [2.5, 97.5])
77
78    return {
79        'prob_b_better': prob_b_better,
80        'expected_lift_percent': expected_lift,
81        'posterior_mean_a': samples_a.mean(),
82        'posterior_mean_b': samples_b.mean(),
83        'ci_95_a': ci_a,
84        'ci_95_b': ci_b,
85        'ci_95_diff': ci_diff
86    }
87
88# Example
89result = bayesian_ab_test(
90    successes_a=120, n_a=1000,  # 12% conversion
91    successes_b=145, n_b=1000   # 14.5% conversion
92)
93print(f"P(B > A) = {result['prob_b_better']:.1%}")
94print(f"Expected lift: {result['expected_lift_percent']:.1f}%")
95
96
97# ================================================
98# 3. Sequential Testing (O'Brien-Fleming)
99# ================================================
100
101def obrien_fleming_boundary(alpha, n_looks, look_number):
102    """
103    Calculate O'Brien-Fleming boundary for sequential testing.
104
105    Returns z-score threshold for rejecting H0 at this look.
106    """
107    if look_number > n_looks:
108        raise ValueError("look_number cannot exceed n_looks")
109
110    # O'Brien-Fleming spending function
111    t = look_number / n_looks  # Information fraction
112
113    # Approximate z-boundary using spending function
114    z_alpha_full = stats.norm.ppf(1 - alpha/2)
115    z_boundary = z_alpha_full / np.sqrt(t)
116
117    return z_boundary
118
119# Example: 3 interim looks + 1 final
120for look in [1, 2, 3, 4]:
121    boundary = obrien_fleming_boundary(alpha=0.05, n_looks=4, look_number=look)
122    print(f"Look {look}: |z| > {boundary:.3f} to reject H0")
123
124
125# ================================================
126# 4. Complete A/B Test Analysis
127# ================================================
128
129def analyze_ab_test(successes_a, n_a, successes_b, n_b, alpha=0.05):
130    """Complete A/B test analysis with both frequentist and Bayesian results."""
131
132    # Frequentist analysis
133    p_a = successes_a / n_a
134    p_b = successes_b / n_b
135
136    p_pooled = (successes_a + successes_b) / (n_a + n_b)
137    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
138    z = (p_b - p_a) / se
139    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
140
141    se_diff = np.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
142    z_crit = stats.norm.ppf(1 - alpha/2)
143    ci = (p_b - p_a - z_crit*se_diff, p_b - p_a + z_crit*se_diff)
144
145    # Bayesian analysis
146    bayes = bayesian_ab_test(successes_a, n_a, successes_b, n_b)
147
148    # Print report
149    print("=" * 60)
150    print("A/B TEST ANALYSIS REPORT")
151    print("=" * 60)
152    print(f"\nSample Sizes: Control = {n_a:,}, Treatment = {n_b:,}")
153    print(f"Conversions:  Control = {successes_a:,}, Treatment = {successes_b:,}")
154    print(f"\nConversion Rates:")
155    print(f"  Control:   {p_a*100:.2f}%")
156    print(f"  Treatment: {p_b*100:.2f}%")
157    print(f"  Lift:      {(p_b-p_a)/p_a*100:+.2f}%")
158    print(f"\nFrequentist Results:")
159    print(f"  z-statistic: {z:.3f}")
160    print(f"  p-value:     {p_value:.4f}")
161    print(f"  95% CI:      [{ci[0]*100:.2f}%, {ci[1]*100:.2f}%]")
162    print(f"  Significant: {'Yes' if p_value < alpha else 'No'} (α = {alpha})")
163    print(f"\nBayesian Results:")
164    print(f"  P(B > A):    {bayes['prob_b_better']*100:.1f}%")
165    print(f"  95% Credible Interval: [{bayes['ci_95_diff'][0]*100:.2f}%, {bayes['ci_95_diff'][1]*100:.2f}%]")
166
167    if bayes['prob_b_better'] > 0.95:
168        print("\n✅ RECOMMENDATION: Ship treatment (high confidence B is better)")
169    elif bayes['prob_b_better'] < 0.05:
170        print("\n❌ RECOMMENDATION: Keep control (high confidence A is better)")
171    else:
172        print("\n⏳ RECOMMENDATION: Continue collecting data")
173
174# Example usage
175analyze_ab_test(
176    successes_a=1200, n_a=10000,  # 12% baseline
177    successes_b=1350, n_b=10000   # 13.5% treatment
178)

Knowledge Check

Test your understanding of A/B testing for ML applications with this quiz.

Knowledge Check

Question 1 of 8

A company runs an A/B test comparing two recommendation models. After 1 week, the p-value is 0.08. After 2 weeks, it drops to 0.03. What is the correct interpretation?

Summary

Key Takeaways

A/B testing is essential for ML deployment: Offline metrics don't always predict online performance. Randomized experiments provide causal evidence.
Power analysis is critical: Calculate sample size before starting. Effect size scales inversely with the square of sample size.
Beware the peeking problem: Repeated checking at α = 0.05 inflates false positive rates dramatically. Use sequential testing or Bayesian methods.
Multiple metrics need multiple considerations: Define primary and guardrail metrics. Consider corrections for multiple testing.
Bayesian A/B testing offers practical advantages: Continuous monitoring, probability statements, and faster decisions—but requires different interpretation.
Bandits optimize differently than A/B tests: Multi-armed bandits minimize regret but sacrifice statistical rigor. Choose based on your goals.

Quick Reference

Topic	Key Formula / Rule
Z-test statistic	z = (p_B - p_A) / √[p̂(1-p̂)(1/n_A + 1/n_B)]
Sample size (approx)	n ≈ 16σ²/δ² for 80% power, α=0.05
False positive with k peeks	P(FP) ≈ 1 - (1-α)^k
Bayesian posterior	p \| data ~ Beta(1 + successes, 1 + failures)
Regret	T·μ* - Σ E[r_t]

Industry Best Practices

Before the Test

• Define hypothesis and success criteria
• Calculate required sample size
• Pre-register analysis plan
• Set up proper logging and metrics

During and After

• Check for sample ratio mismatch
• Don't peek without correction
• Run long enough for novelty effects
• Document and share learnings

Looking Ahead: In the next section, we'll explore Sequential Testing, which provides a principled framework for monitoring experiments over time while maintaining proper error control.

Learning Objectives

📚 Core Knowledge

🔧 Practical Skills

🧠 Real-World Applications

The Big Picture: Why A/B Testing Matters for ML

From Fisher to Silicon Valley

A Brief History of A/B Testing

The A/B Testing Framework

Hypothesis Formulation for ML Metrics

Standard A/B Test Hypotheses

Statistical Tests for A/B Testing

Two-Proportion Z-Test

Interactive: A/B Test Simulator

🔬A/B Testing Simulator: ML Model Comparison

⚙️ Simulation Parameters (True Rates - Unknown in Real Life)

Model A (Control - Current)

Model B (Treatment - New)

🧠 How This Works in Real ML Systems

⚠️ Warning: Peeking Problem

Power Analysis and Sample Size

Sample Size Formula (Per Group)

Interactive: Sample Size Calculator

📊Sample Size & Power Calculator for A/B Tests

What do you want to calculate?

Test Parameters

Specify MDE & Power

Required Sample Size

Power Curve: How Sample Size Affects Power

💡 Key Insights for ML Engineers

Testing ML-Specific Metrics

Primary Metrics

Guardrail Metrics

Interactive: Multi-Metric ML Testing

🎯ML Model A/B Testing: Multi-Metric Comparison

Select Primary Metric for Testing

Model A (Baseline) - True Values

Model B (Challenger) - True Values

Model A (Baseline)

Model B (Challenger)

The Peeking Problem

⚠️ Why Peeking is Dangerous

Interactive: Peeking Simulation

👀The Peeking Problem: How Repeated Checking Inflates False Positives

Simulation Settings

Peeking Behavior

Solutions to the Peeking Problem

Bayesian A/B Testing

Frequentist Approach

Bayesian Approach

Bayesian A/B Test with Beta Priors

Interactive: Bayesian A/B Testing

A/B Testing Simulator

A (Control)

B (Treatment)

Why Bayesian A/B Testing?

A/B Testing vs Multi-Armed Bandits

Interactive: Bandits Comparison

🎰A/B Testing vs Multi-Armed Bandits

True Conversion Rates

Experiment Settings

💡 When to Use Each Approach

Practical Considerations

✨Novelty and Primacy Effects

🔗Network Effects and Interference

📊Multiple Testing Corrections

⚖️Sample Ratio Mismatch (SRM)

Python Implementation

Knowledge Check

Knowledge Check

Summary

Key Takeaways

Quick Reference

Industry Best Practices