← All posts

Variance is risk. Standard deviation is the language of risk.

The underlying idea

Imagine two investment funds. Both averaged 8% annual returns over the last decade. Fund A returned between 6% and 10% every year. Fund B swung between -15% and +30%. Same average. Completely different experience. The difference is variance: how spread out the values are around the average.

Variance quantifies the “spread” in a dataset. When people say a measurement is “noisy,” they mean it has high variance. When they say an estimate is “precise,” they mean it has low variance. When a financial analyst says an asset is “risky,” they mean its returns have high variance.

Three closely related numbers show up constantly: variance, standard deviation, and standard error. They measure different things, get confused with each other regularly, and understanding the distinctions is essential for every statistical method that follows.

Variance measures how far individual observations fall from the mean. Standard deviation is the square root of variance, putting the spread back into the original units. Standard error measures how far a sample statistic (like the sample mean) falls from the true population value. Variance and standard deviation describe your data. Standard error describes the precision of your estimate.

Historical root

The concept of variance traces back to Carl Friedrich Gauss and Adrien-Marie Legendre in the early 1800s. Both worked on the problem of combining astronomical measurements to get the best estimate of a planet’s orbit. They independently developed the method of least squares, which minimizes the sum of squared deviations from a fitted line. The “squared deviations” part is the direct ancestor of variance.

Ronald Fisher formalized the term “variance” in his 1918 paper “The Correlation between Relatives on the Supposition of Mendelian Inheritance.” Fisher was studying how traits varied across generations of organisms, and he needed a word for the quantity that measured this variation. Before Fisher, statisticians used phrases like “mean square deviation.” Fisher’s single word stuck.

The distinction between dividing by nn versus n1n - 1 when computing sample variance also comes from Fisher. He showed in the 1920s that dividing by nn systematically underestimates the true population variance. Dividing by n1n - 1 corrects this bias. The correction matters most with small samples and becomes negligible with large ones.

Key assumptions

Meaningful mean. Variance measures spread around the mean. If the mean isn’t a meaningful summary of your data (for example, with bimodal distributions or heavy-tailed distributions), variance alone can be misleading. A dataset with two distinct clusters might have the same variance as a dataset with one cluster, but the interpretation is completely different.

Finite variance. Not all distributions have finite variance. The Cauchy distribution (which appears in physics and some financial models) has undefined variance. If your data follows a distribution with infinite or undefined variance, sample variance doesn’t converge to a stable number as you collect more data. It keeps jumping around.

Scale sensitivity. Variance depends on the units of measurement. Income measured in dollars has a different variance than income measured in thousands of dollars. This means you can’t directly compare the variance of two variables measured on different scales without standardizing first.

Outlier sensitivity. Because variance involves squared deviations, a single extreme value can inflate it dramatically. Ten data points clustered tightly around the mean, plus one outlier far away, will have a variance dominated by that one outlier. The median absolute deviation (MAD) is a more robust alternative when outliers are a concern.

The math

Population variance for a variable XX with mean μ\mu:

σ2=E[(Xμ)2]=1Ni=1N(xiμ)2\sigma^2 = E[(X - \mu)^2] = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2

This is the average squared distance from the mean. Squaring serves two purposes: it makes all deviations positive (so values above and below the mean don’t cancel out), and it penalizes large deviations more than small ones.

Sample variance uses n1n - 1 in the denominator instead of nn:

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

The n1n - 1 is called Bessel’s correction. The intuition: when you compute deviations from the sample mean xˉ\bar{x} instead of the true mean μ\mu, the deviations are systematically smaller because xˉ\bar{x} is chosen to minimize them. Dividing by n1n - 1 compensates for this, producing an unbiased estimate of σ2\sigma^2.

Standard deviation is the square root of variance:

σ=σ2(population)s=s2(sample)\sigma = \sqrt{\sigma^2} \qquad \text{(population)} \qquad s = \sqrt{s^2} \qquad \text{(sample)}

Standard deviation is in the same units as the original data. If your data is in dollars, variance is in “dollars squared” (which has no intuitive meaning), but standard deviation is back in dollars.

Standard error of the mean connects standard deviation to sampling precision:

SE(Xˉ)=sn\text{SE}(\bar{X}) = \frac{s}{\sqrt{n}}

This is the standard deviation of the sampling distribution of the mean, not the standard deviation of the data itself. It tells you how much your sample mean would vary if you repeated the sampling process many times. This formula was central to the sampling post (A3) and shows up in every confidence interval and hypothesis test.

Coefficient of variation (CV) allows comparison across different scales:

CV=sxˉ×100%\text{CV} = \frac{s}{\bar{x}} \times 100\%

A CV of 10% means the standard deviation is 10% of the mean, regardless of the units. This lets you compare the relative variability of two measurements on completely different scales.

The code

Three panels showing same mean with different variance, standard deviation vs standard error, and Bessel's correction

This script demonstrates the difference between variance, standard deviation, and standard error, and shows why high variance isn’t always bad.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# --- Panel 1: Same mean, different variance ---
fund_a = rng.normal(loc=8, scale=2, size=100)    # low volatility
fund_b = rng.normal(loc=8, scale=12, size=100)   # high volatility

axes[0].hist(fund_a, bins=20, alpha=0.6, label=f"Fund A (sd={fund_a.std():.1f})", color="steelblue")
axes[0].hist(fund_b, bins=20, alpha=0.6, label=f"Fund B (sd={fund_b.std():.1f})", color="coral")
axes[0].axvline(x=8, color="black", linestyle="--", label="Mean = 8%")
axes[0].set_xlabel("Annual return (%)")
axes[0].set_title("Same mean, different risk")
axes[0].legend(fontsize=8)

# --- Panel 2: Standard deviation vs standard error ---
population = rng.normal(loc=50, scale=15, size=100_000)
sample_sizes = [10, 25, 50, 100, 250, 500, 1000]
sds = []
ses = []

for n in sample_sizes:
    sample = rng.choice(population, size=n)
    sds.append(sample.std(ddof=1))
    ses.append(sample.std(ddof=1) / np.sqrt(n))

axes[1].plot(sample_sizes, sds, 'o-', color="steelblue", label="Std Dev (s)")
axes[1].plot(sample_sizes, ses, 's-', color="coral", label="Std Error (SE)")
axes[1].set_xlabel("Sample size")
axes[1].set_ylabel("Value")
axes[1].set_title("SD stays flat, SE shrinks with n")
axes[1].legend(fontsize=8)
axes[1].set_xscale("log")

# --- Panel 3: Bessel's correction matters for small n ---
true_var = 15**2  # population variance = 225
sample_sizes_small = [3, 5, 10, 20, 50, 100, 500]
bias_n = []     # dividing by n
bias_n1 = []    # dividing by n-1

for n in sample_sizes_small:
    vars_n = []
    vars_n1 = []
    for _ in range(5000):
        sample = rng.normal(loc=50, scale=15, size=n)
        vars_n.append(sample.var(ddof=0))    # divide by n
        vars_n1.append(sample.var(ddof=1))   # divide by n-1
    bias_n.append(np.mean(vars_n) - true_var)
    bias_n1.append(np.mean(vars_n1) - true_var)

axes[2].plot(sample_sizes_small, bias_n, 'o-', color="coral", label="Divide by n (biased)")
axes[2].plot(sample_sizes_small, bias_n1, 's-', color="seagreen", label="Divide by n-1 (unbiased)")
axes[2].axhline(y=0, color="black", linestyle="--")
axes[2].set_xlabel("Sample size")
axes[2].set_ylabel("Average bias")
axes[2].set_title("Bessel's correction matters for small n")
axes[2].legend(fontsize=8)

plt.tight_layout()
plt.savefig("variance_demo.png", dpi=150)
plt.show()

# Print the key distinction
sample = rng.normal(loc=50, scale=15, size=100)
print(f"Sample std dev (s):    {sample.std(ddof=1):.2f}  -- describes the data")
print(f"Standard error (SE):   {sample.std(ddof=1)/np.sqrt(100):.2f}  -- describes estimate precision")

The left panel shows two investments with the same average return but vastly different standard deviations. Fund A is a steady performer. Fund B is a roller coaster. Same mean, completely different risk profile.

The middle panel shows the critical distinction between standard deviation and standard error. As sample size grows, standard deviation stays roughly constant (it’s a property of the data, not the sample size). Standard error shrinks (it’s a property of the estimate, and it improves with more data). Confusing these two is one of the most common mistakes in applied statistics.

The right panel demonstrates Bessel’s correction. Dividing by nn systematically underestimates the true variance (negative bias), especially with small samples. Dividing by n1n - 1 eliminates this bias. By n=50n = 50, the difference is small. By n=500n = 500, it’s negligible.

Business application

Portfolio risk management. In finance, standard deviation of returns is the standard measure of risk. A fund with 20% annual standard deviation is roughly twice as volatile as one with 10%. But variance isn’t always bad. A startup stock with high variance might produce spectacular returns or total losses. A government bond with low variance produces steady but modest returns. The right choice depends on your risk tolerance and time horizon, not on which number is smaller.

Manufacturing quality control. In manufacturing, low variance means consistent output. If a machine is supposed to cut parts to 10.0 cm, a standard deviation of 0.01 cm means almost every part is within spec. A standard deviation of 0.5 cm means many parts will be rejected. Control charts track variance over time: a sudden increase in variance signals that something in the process has changed, even if the mean hasn’t moved.

A/B test sample size calculations. When planning an A/B test, you need the expected variance of the metric to determine how many users you need. Higher variance means you need more data to detect a given effect size. If your revenue-per-user has a standard deviation of $50 and you want to detect a $2 difference, you’ll need a much larger sample than if the standard deviation were $5. The standard error formula (s/ns / \sqrt{n}) is the direct link between variance, sample size, and statistical power.

When high variance is informative, not problematic. In customer segmentation, high variance in a metric can signal that you’re looking at a mixture of distinct groups. If average session duration has a standard deviation nearly as large as the mean, your “average user” doesn’t exist. There are probably short-session users and long-session users, and analyzing them as a single group hides the real patterns. High variance is a signal to stratify, not a problem to solve.

PO

Pius Oyedepo

Statistician and data analyst. Writing about the math behind the models.