Confidence intervals don't mean what you think they mean

The underlying idea

A polling company reports that 52% of likely voters support a candidate, with a margin of error of plus or minus 3 percentage points. Most people read this as “the true support is somewhere between 49% and 55%.” That interpretation is close enough for casual conversation. But it’s technically wrong, and the distinction matters when you’re making decisions based on statistical estimates.

A confidence interval is a range computed from sample data that, if the sampling process were repeated many times, would contain the true population value a specified percentage of the time. A 95% confidence interval doesn’t mean there’s a 95% probability that the true value falls in this particular interval. It means the procedure that generated the interval captures the true value 95% of the time across many repetitions.

This sounds like a pointless distinction. It isn’t. The true population value is a fixed number. It’s either inside this interval or it isn’t. There’s no probability about it. The probability statement is about the method, not about any single interval.

Why does this matter in practice? Because people treat confidence intervals as measures of certainty about their specific result. A product manager who sees a 95% CI of [2%, 8%] for conversion lift might say “I’m 95% sure the lift is between 2% and 8%.” That’s not what the interval claims. The interval claims that the procedure is reliable, not that this particular output is correct.

Historical root

Jerzy Neyman introduced confidence intervals in 1937, building on ideas from Ronald Fisher. The two men had different philosophies and spent decades arguing about the correct interpretation of statistical inference.

Fisher had developed the concept of “fiducial intervals” in 1930, which he intended as statements about the probability that a parameter lies in a given range. Neyman rejected this interpretation. He argued that the parameter is fixed (not random), and that probability statements should only apply to the procedure, not to the parameter.

Neyman’s formulation won out in mainstream statistics, but the philosophical confusion never fully resolved. Textbooks still routinely present confidence intervals with language that sounds more like Fisher’s interpretation than Neyman’s. The result is that almost everyone who uses confidence intervals misinterprets them, including many practicing statisticians.

The Bayesian alternative, the credible interval, does allow the statement “there is a 95% probability the parameter lies in this range.” But it requires specifying a prior distribution, which introduces its own set of decisions and assumptions. That tradeoff is covered in a later post on Bayesian vs frequentist thinking.

Key assumptions

Random sampling. The confidence interval formula assumes the data was collected through a random sampling process. If the sample is biased (as discussed in the sampling post), the interval will be centered on the wrong value. A narrow confidence interval from a biased sample is worse than a wide one from a representative sample.

Known or estimable variability. The interval width depends on the standard error, which in turn depends on the variance of the data. If the variance estimate is unreliable (small sample from a heavy-tailed distribution), the interval width will be wrong. With small samples, you should use the t-distribution instead of the normal to compute wider, more honest intervals.

Independence of observations. The standard error formula assumes observations are independent. If they’re correlated (time series data, clustered data, repeated measures on the same person), the true standard error is larger than the formula suggests, and the confidence interval is too narrow. Your stated 95% interval might actually be an 80% interval.

Normality of the sampling distribution. For means and proportions with moderate to large samples, the Central Limit Theorem guarantees approximate normality. With small samples, especially from skewed populations, the normal approximation breaks down. For proportions near 0 or 1 with small $n$ , use the Wilson interval or Clopper-Pearson interval instead of the standard Wald interval.

The math

The general form of a confidence interval is:

\text{estimate} \pm \text{critical value} \times \text{standard error}

For a population mean with known variance:

\bar{X} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}

where $z_{\alpha/2}$ is the critical value from the standard normal distribution. For a 95% interval, $z_{0.025} = 1.96$ .

In practice, you don’t know $\sigma$ and must estimate it from the sample. This changes the distribution from normal to Student’s t:

\bar{X} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}

The t-distribution has heavier tails than the normal, producing wider intervals that account for the uncertainty in estimating $\sigma$ . With large $n$ , the t-distribution converges to the normal, and the distinction vanishes.

For a proportion:

\hat{p} \pm z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}

This is the Wald interval. It works well when $n\hat{p}$ and $n(1-\hat{p})$ are both at least 10. When they aren’t (rare events, small samples), the Wilson interval is more reliable:

\tilde{p} \pm \frac{z_{\alpha/2}}{1 + z^2/n} \sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}

where $\tilde{p} = (\hat{p} + z^2/2n) / (1 + z^2/n)$ . The Wilson interval doesn’t collapse to zero width when $\hat{p} = 0$ or 1, which is a known failure mode of the Wald interval.

The code

Two panels showing 100 confidence intervals with capture rate and interval width vs sample size

This script demonstrates what “95% confidence” actually means by simulating many confidence intervals and counting how many capture the true value.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

rng = np.random.default_rng(42)

# True population
true_mean = 100
true_sd = 15
n = 30

# --- Simulation: what does "95% confidence" really mean? ---
n_intervals = 100
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Generate 100 confidence intervals
captured = 0
for i in range(n_intervals):
    sample = rng.normal(loc=true_mean, scale=true_sd, size=n)
    x_bar = sample.mean()
    se = sample.std(ddof=1) / np.sqrt(n)
    t_crit = stats.t.ppf(0.975, df=n-1)
    ci_low = x_bar - t_crit * se
    ci_high = x_bar + t_crit * se

    contains = ci_low <= true_mean <= ci_high
    if contains:
        captured += 1
    color = "steelblue" if contains else "coral"
    axes[0].plot([ci_low, ci_high], [i, i], color=color, linewidth=1.5)
    axes[0].plot(x_bar, i, 'o', color=color, markersize=3)

axes[0].axvline(x=true_mean, color="black", linestyle="--", linewidth=1.5,
                label=f"True mean = {true_mean}")
axes[0].set_xlabel("Value")
axes[0].set_ylabel("Sample number")
axes[0].set_title(f"100 confidence intervals ({captured} captured the true mean)")
axes[0].legend(fontsize=9)

# --- Effect of sample size on interval width ---
sample_sizes = [10, 20, 50, 100, 250, 500, 1000]
widths = []

for n_size in sample_sizes:
    sample = rng.normal(loc=true_mean, scale=true_sd, size=n_size)
    se = sample.std(ddof=1) / np.sqrt(n_size)
    t_crit = stats.t.ppf(0.975, df=n_size-1)
    width = 2 * t_crit * se
    widths.append(width)

axes[1].plot(sample_sizes, widths, 'o-', color="steelblue", linewidth=2)
axes[1].set_xlabel("Sample size")
axes[1].set_ylabel("95% CI width")
axes[1].set_title("Larger samples produce narrower intervals")
axes[1].set_xscale("log")

plt.tight_layout()
plt.savefig("confidence_intervals.png", dpi=150)
plt.show()

print(f"Intervals containing true mean: {captured}/100")
print(f"CI width at n=30:   {widths[2]:.1f}")
print(f"CI width at n=1000: {widths[-1]:.1f}")

The left panel shows 100 confidence intervals computed from 100 different random samples. Blue intervals contain the true mean. Red ones miss it. Roughly 95 out of 100 will be blue. That’s what “95% confidence” means: not that any single interval has a 95% chance of being right, but that the method produces correct intervals about 95% of the time.

The right panel shows how interval width shrinks with sample size. At $n = 10$ , the interval is wide (low precision). At $n = 1000$ , it’s narrow (high precision). The square root relationship from the standard error formula ( $s / \sqrt{n}$ ) is visible: you need to quadruple the sample to halve the width.

Business application

Reporting A/B test results. When reporting that a new feature increased conversion by 3% with a 95% CI of [1%, 5%], you’re saying the method is reliable, not that you’re 95% certain the true lift is between 1% and 5%. The practical consequence: if you run many tests and act on every result where the CI excludes zero, about 5% of those actions will be based on false positives. Across a year of weekly A/B tests, that’s two or three wrong decisions.

Financial forecasting. Revenue forecasts should always include an interval, not just a point estimate. Saying “we expect $10M in Q4" is less useful than "we expect$ 10M in Q4, with a 90% interval of $8M to$ 12M.” The interval communicates the uncertainty in the forecast. Without it, stakeholders treat the point estimate as a certainty and react badly when reality deviates.

Clinical trials. Drug trials report confidence intervals for treatment effects. If the CI for the difference in recovery rates between drug and placebo is [2%, 12%], the drug probably works but the precision is low. If the CI is [7%, 9%], the drug probably works and the precision is high. The width of the interval matters as much as whether it excludes zero.

Common misinterpretations to avoid. “There is a 95% probability the true mean is in this interval” is wrong (the true mean is fixed, not random). “95% of the data falls in this interval” is wrong (that’s a prediction interval, not a confidence interval). “If we repeated the study, the result would be in this interval 95% of the time” is wrong (each study would produce a different interval). The correct statement: “If we repeated the sampling process many times, 95% of the resulting intervals would contain the true parameter.” It’s less satisfying. It’s also the only one that’s true.