A/B testing under the hood: what the platform isn't telling you

The underlying idea

You open your A/B testing dashboard. The experiment has been running for five days. Variant B is showing a 6% lift in conversion. The p-value is 0.03. There’s a green badge that says “Significant.” You click “Declare Winner” and ship Variant B.

You just made at least three statistical mistakes.

A/B testing platforms are built for speed, not statistical hygiene. They show you numbers in real time, encourage early stopping, test multiple metrics simultaneously, and make it easy to declare significance before your predetermined sample size is reached. Every one of these behaviors inflates your false positive rate.

This is not a criticism of A/B testing. It’s the most rigorous decision-making tool in the product analyst’s toolkit when used correctly. The problem is that “using it correctly” requires understanding the statistical machinery underneath the UI. The platform won’t remind you to pre-register your hypothesis, calculate your required sample size, or account for multiple comparisons. It just shows you numbers.

This post connects every concept from Phase 2 of the Statbitall curriculum into a single, practical framework for running experiments that produce reliable conclusions.

Historical root

A/B testing as a business practice traces back to direct mail marketing in the early 20th century, where advertisers would send two versions of a mailer to different halves of a mailing list and compare response rates. The statistical foundations come from Fisher’s randomized experiments in agriculture (1920s) and Neyman-Pearson hypothesis testing (1933).

The modern form emerged in internet companies in the 2000s. Amazon, Google, and Microsoft all built internal A/B testing infrastructure in the early 2000s. Kohavi, Longbotham, Sommerfield, and Hurst’s 2009 paper “Controlled Experiments on the Web” became a landmark document for how technology companies should run experiments at scale. It introduced the concept of the overall evaluation criterion, the importance of pre-experiment power calculations, and the dangers of peeking.

The proliferation of SaaS A/B testing tools in the 2010s democratized experimentation but also spread the statistical mistakes. Tools optimized for ease of use removed the friction that forced analysts to think carefully about their experimental design.

Key assumptions

Random assignment. Users must be randomly assigned to control and treatment. If users self-select into a variant, the comparison is invalid. The treatment group is a different population from the control group.

One experiment at a time per user. If a user is in multiple simultaneous experiments, the treatments may interact. Experiment A changes the page layout while Experiment B changes the button color. The observed effects are confounded.

Stable unit treatment value assumption (SUTVA). The outcome for one user is unaffected by which variant another user received. This breaks down in social networks, marketplace platforms, and referral programs where users influence each other.

Pre-specified sample size. You must decide how many users to include before the experiment starts. Stopping when you hit significance (peeking) inflates the false positive rate substantially. After 10 looks at the data, your actual false positive rate at nominal alpha = 0.05 is closer to 19%.

Pre-specified primary metric. You must decide which metric is the primary outcome before the experiment starts. Testing 10 metrics and reporting the one that is significant gives you a family-wise false positive rate far above 5%.

The math

The two-sample z-test for proportions is the most common test in A/B testing:

$z = \frac{\hat{p}_B - \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}}$

where $\hat{p}_A$ and $\hat{p}_B$ are the observed conversion rates, $n_A$ and $n_B$ are the sample sizes, and $\hat{p}$ is the pooled conversion rate.

Required sample size per variant:

$n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot [\hat{p}_A(1-\hat{p}_A) + \hat{p}_B(1-\hat{p}_B)]}{\delta^2}$

where $\delta$ is the minimum detectable effect and $z_{\beta}$ corresponds to your target power (0.84 for 80% power).

The peeking problem quantified. If you peek at the data $k$ times and stop as soon as $p < 0.05$ , your actual false positive rate follows the Pocock correction approximately as:

$\alpha_{\text{actual}} \approx \alpha \cdot (1 + \ln k)$

At $k = 5$ peeks: actual rate is roughly $0.05 \times (1 + 1.61) = 0.13$ . At $k = 20$ peeks: roughly 0.20. The more often you look, the higher the chance of a false positive.

Sequential testing solves the peeking problem by adjusting the significance threshold at each look. The sequential probability ratio test (SPRT) maintains exact Type I error control regardless of when you stop, at the cost of a larger expected sample size.

The code

Three panels showing peeking inflation, sample size vs effect size, and multiple metrics family-wise error rate

import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

def required_sample_size(p_baseline, mde, alpha=0.05, power=0.80):
    p_treatment = p_baseline + mde
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    var_baseline = p_baseline * (1 - p_baseline)
    var_treatment = p_treatment * (1 - p_treatment)
    n = (z_alpha + z_beta)**2 * (var_baseline + var_treatment) / mde**2
    return int(np.ceil(n))

def run_ab_test(n_a, n_b, conv_a, conv_b, alpha=0.05):
    p_a = conv_a / n_a
    p_b = conv_b / n_b
    p_pooled = (conv_a + conv_b) / (n_a + n_b)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
    z = (p_b - p_a) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    return p_a, p_b, z, p_value

# --- Step 1: Calculate required sample size ---
p_baseline = 0.10
mde = 0.02
n_required = required_sample_size(p_baseline, mde)
print(f"Required sample size: {n_required:,} per variant")
print(f"Total experiment size: {2*n_required:,} users")

# --- Step 2: Simulate peeking vs fixed sample ---
n_total = n_required
daily_users = 100
false_positives_peeking = 0
false_positives_fixed = 0
n_simulations = 1000

for _ in range(n_simulations):
    # Null is TRUE: both groups have same conversion rate
    conv_a = rng.binomial(1, p_baseline, n_total)
    conv_b = rng.binomial(1, p_baseline, n_total)

    # Fixed sample: test once at the end
    _, _, _, p_fixed = run_ab_test(
        n_total, n_total, conv_a.sum(), conv_b.sum()
    )
    if p_fixed < 0.05:
        false_positives_fixed += 1

    # Peeking: test at every 100-user increment
    declared = False
    for day in range(1, n_total // daily_users + 1):
        n_so_far = day * daily_users
        _, _, _, p_peek = run_ab_test(
            n_so_far, n_so_far,
            conv_a[:n_so_far].sum(), conv_b[:n_so_far].sum()
        )
        if p_peek < 0.05:
            declared = True
            break
    if declared:
        false_positives_peeking += 1

print(f"\nSimulation results ({n_simulations} experiments, null TRUE):")
print(f"  Fixed sample FPR: {false_positives_fixed/n_simulations:.3f} (target 0.05)")
print(f"  Peeking FPR:      {false_positives_peeking/n_simulations:.3f} (inflated)")

# --- Step 3: Multiple metrics ---
n_metrics = 8
alpha_bonferroni = 0.05 / n_metrics
fwer_uncorrected = 1 - (0.95 ** n_metrics)
print(f"\nWith {n_metrics} metrics at alpha=0.05:")
print(f"  Family-wise error rate: {fwer_uncorrected:.3f}")
print(f"  Bonferroni threshold: {alpha_bonferroni:.4f} per metric")

The left panel shows that peeking nearly doubles the false positive rate even with relatively infrequent checks. The simulation makes the inflation concrete: with the null true, fixed-sample testing produces about 5% false positives while peeking produces 15-20%.

The middle panel shows why small effects are expensive to detect. A 0.5% lift requires ten times the data of a 2% lift. Setting the MDE too small is how teams end up running experiments for months.

The right panel shows multiple metrics inflating the family-wise error rate. Testing 8 metrics at the standard threshold gives you a 34% chance of at least one false positive. Bonferroni correction controls this at the cost of requiring stronger evidence per metric.

Business application

The pre-experiment checklist. Before starting any A/B test: state the primary metric in writing (secondary metrics are for exploration only); calculate the required sample size based on your baseline conversion, MDE, and desired power; write down the date you expect to reach that size; do not look at results until that date. If you must monitor for harm (a variant crashing the site), use a separate guardrail metric with Bonferroni correction.

When to use sequential testing. If your business genuinely cannot afford to wait for a fixed sample (high-velocity products, safety-critical features), use sequential testing methods like SPRT or always-valid confidence intervals. These allow early stopping with controlled error rates. Most major experimentation platforms now support sequential testing, though it is often not the default.

The minimum detectable effect is a business decision. Before calculating sample size, ask: what is the smallest lift that would actually change a decision? A 0.1% conversion lift may be statistically detectable with a million users but economically irrelevant. A 5% lift might be business-transforming but require only 200 users to detect. The MDE should reflect what matters, not what is easy to detect.

The no significant effect trap. A non-significant result from an underpowered experiment is not evidence that the variant does not work. It is an inconclusive result from a test that could not tell. Before concluding “no effect,” check your post-hoc power. If power was below 80% for the effect size you care about, your experiment was inconclusive.

Network effects and SUTVA violations. Social platforms, marketplaces, and two-sided markets often violate the stable unit treatment value assumption. Cluster randomization (assigning whole communities, not individuals) and switchback experiments (alternating treatments over time periods) are designed for these cases.