Aleatoric and Epistemic Uncertainty

This section builds on the empirical and expected risk page. There the expected test error at a point was decomposed into bias, variance, and an irreducible noise term:

R(\boldsymbol{\theta}) \;=\; \underbrace{\mathrm{Bias}^2 + \mathrm{Var}}_{\text{epistemic}} \;+\; \underbrace{\sigma^2}_{\text{aleatoric}}

These two groups are the two fundamental sources of predictive uncertainty:

Aleatoric uncertainty ( $\sigma^2$ ), also called statistical or data uncertainty, is the irreducible noise in the data-generating process itself: sensor noise, measurement error, intrinsic randomness, label ambiguity. It is a property of the world, not of your model, so it cannot be reduced by collecting more data or choosing a better hypothesis. In object detection, for instance, bounding boxes drawn by different annotators vary slightly, and motion blur makes object boundaries genuinely ambiguous. Even with infinite training data this floor on the achievable error remains.
Epistemic uncertainty ( $\mathrm{Bias}^2 + \mathrm{Var}$ ), also called model or knowledge uncertainty, is uncertainty about the model: the parameters $\boldsymbol{\theta}$ you inferred from a finite training set, the regions of input space you never observed, the capacity and assumptions you imposed. It is reducible: it shrinks as the training set grows, as you improve the model, and as you tune capacity and regularization toward the optimum (the $\lambda$ optimization from the regression section). A model trained only on adult scans, for example, is highly epistemically uncertain on pediatric ones, a gap that more representative data closes.

Below you make this split concrete with the synthetic sinusoid and the regularized ridge model from the linear regression section.

Two ways to write the total uncertainty

The bias-variance form above comes from the squared-error decomposition. An equivalent, model-centric statement is the law of total variance, which is the form you meet in Bayesian neural networks, Gaussian processes, and deep ensembles:

\underbrace{\mathrm{Var}(y \mid x)}_{\text{total}} \;=\; \underbrace{\mathbb{E}_{\boldsymbol{\theta}}\!\big[\mathrm{Var}(y \mid x, \boldsymbol{\theta})\big]}_{\text{aleatoric}} \;+\; \underbrace{\mathrm{Var}_{\boldsymbol{\theta}}\!\big[\mathbb{E}(y \mid x, \boldsymbol{\theta})\big]}_{\text{epistemic}}.

The first term averages the noise the model expects given its parameters (the aleatoric floor); the second measures how much the model’s mean prediction moves as the parameters change across plausible fits (the epistemic spread). This is exactly the quantity that connects to the learning problem: the gap

p_{\text{model}}(\mathbf{x}) \neq p_{\text{data}}(\mathbf{x})

is largely epistemic, and training reduces it by minimizing

D_{\mathrm{KL}}(\hat{p}_{\text{data}} \,\|\, p_{\text{model}})

. Even once

p_{\text{model}} = p_{\text{data}}

, the aleatoric term survives, because the world itself is noisy.

The data-generating process

The targets are a deterministic signal corrupted by additive Gaussian noise,

y = f(x) + \varepsilon, \qquad f(x) = \sin(2\pi x), \qquad \varepsilon \sim \mathcal{N}(0, \sigma^2), \quad \sigma = 0.25.

The noise

\varepsilon

is the aleatoric uncertainty: even if we knew

f

exactly, every observed

y

would still scatter around it with variance

\sigma^2

. This

y = f(x) + \varepsilon

is exactly the Gaussian likelihood from conditional MLE read as the data-generating truth, so

\sigma^2

is the aleatoric term that a fixed-variance maximum-likelihood fit assumes, and the conditional mean

f(x)

is the point prediction the model targets. The model used throughout is the degree-9 polynomial ridge regressor from that section, here with a small penalty

\lambda = 0.02

chosen so the fit stays stable across sample sizes.

def f(x):
    return np.sin(2 * np.pi * x)

SIGMA = 0.25  # aleatoric noise standard deviation

def make_data(n):
    x = rng.uniform(0, 1, n)
    return x, f(x) + rng.normal(0, SIGMA, n)

# the regularized ridge model: degree-9 polynomial, standardized features, small lambda
def fit_ridge(x, y, M=9, lam=0.02):
    P = np.vander(x, M + 1, increasing=True)[:, 1:]
    mu, sd = P.mean(0), P.std(0) + 1e-12
    Z = (P - mu) / sd
    theta = np.linalg.solve(Z.T @ Z + lam * np.eye(M), Z.T @ (y - y.mean()))
    return lambda xq: ((np.vander(xq, M + 1, increasing=True)[:, 1:] - mu) / sd) @ theta + y.mean()

Estimating epistemic uncertainty by retraining

Epistemic uncertainty lives in the randomness of

\boldsymbol{\theta}(\hat{p}_{\text{data}})

: a different training set yields a different fitted model. We approximate the distribution of the predictor by drawing many independent training sets, fitting the same ridge model to each, and looking at how the predictions spread. Each thin curve below is one trained model; their spread at a given

x

is the epistemic uncertainty there.

x_grid = np.linspace(0, 1, 200)
T, N = 500, 30  # 500 training sets of 30 points each

preds = np.array([fit_ridge(*make_data(N))(x_grid) for _ in range(T)])
mean_pred = preds.mean(axis=0)

plt.figure(figsize=[10, 8])
for p in preds[:40]:
    plt.plot(x_grid, p, color="C0", alpha=0.12)
plt.plot(x_grid, f(x_grid), "-g", lw=2, label="true $f(x)$")
plt.plot(x_grid, mean_pred, "-r", lw=2, label="mean prediction")
plt.ylim(-2, 2)
plt.xlabel("$x$"); plt.ylabel("$y$"); plt.legend()
plt.show()

Predictive uncertainty bands

Stacking the two sources gives the total predictive uncertainty. At each

x

the variance of a future observation is

\mathrm{Var}[\,y \mid x\,] \;\approx\; \underbrace{\mathrm{Var}_{\boldsymbol{\theta}}[\hat{y}]}_{\text{epistemic}} \;+\; \underbrace{\sigma^2}_{\text{aleatoric}}.

The inner band is epistemic only (model spread); the outer band adds the aleatoric noise floor. The epistemic band widens where training points are scarce (near the boundaries), while the aleatoric contribution is constant everywhere, a fixed property of the noise.

epistemic_std = preds.std(axis=0)                  # model spread across training sets
total_std = np.sqrt(epistemic_std**2 + SIGMA**2)   # add the aleatoric noise

plt.figure(figsize=[10, 8])
plt.fill_between(x_grid, mean_pred - 2*total_std, mean_pred + 2*total_std,
                 color="C1", alpha=0.25, label="epistemic + aleatoric")
plt.fill_between(x_grid, mean_pred - 2*epistemic_std, mean_pred + 2*epistemic_std,
                 color="C0", alpha=0.40, label="epistemic (model)")
plt.plot(x_grid, f(x_grid), "--g", lw=2, label="true $f(x)$")
plt.plot(x_grid, mean_pred, "-r", lw=2, label="mean prediction")
plt.ylim(-2, 2)
plt.xlabel("$x$"); plt.ylabel("$y$"); plt.legend()
plt.show()

The pointwise error breakdown

Averaged over training sets, the expected squared error at each

x

is the sum of three nonnegative pieces,

\mathbb{E}[(\hat{y} - y)^2 \mid x] = \underbrace{\mathrm{Bias}(x)^2 + \mathrm{Var}(x)}_{\text{epistemic}} + \underbrace{\sigma^2}_{\text{aleatoric}}.

The stacked plot shows their relative sizes across the input range. The grey aleatoric floor is flat; the bias

^2

and variance terms are the epistemic part, the only error the model can actually influence.

bias2 = (mean_pred - f(x_grid))**2
epistemic_var = epistemic_std**2
aleatoric_var = np.full_like(x_grid, SIGMA**2)

plt.figure(figsize=[10, 8])
plt.stackplot(x_grid, aleatoric_var, epistemic_var, bias2,
              labels=["aleatoric $\\sigma^2$", "epistemic variance", "bias$^2$"],
              colors=["#9e9e9e", "C0", "C3"], alpha=0.85)
plt.xlabel("$x$"); plt.ylabel("expected squared error")
plt.legend(loc="upper center")
plt.show()

Aleatoric is irreducible; epistemic vanishes with data

The decisive difference between the two is what happens as the training set grows. Re-running the experiment for increasing

N

, the epistemic variance falls steadily toward zero, more data pins down

\boldsymbol{\theta}

, while the aleatoric floor

\sigma^2

stays put. No amount of data removes it.

Ns = [20, 30, 50, 100, 200, 400]
mean_epistemic, mean_bias2 = [], []
for n in Ns:
    P = np.array([fit_ridge(*make_data(n))(x_grid) for _ in range(300)])
    mean_epistemic.append(P.var(axis=0).mean())
    mean_bias2.append(((P.mean(axis=0) - f(x_grid))**2).mean())

plt.figure(figsize=[10, 8])
plt.plot(Ns, mean_epistemic, "-o", label="epistemic (mean variance)")
plt.plot(Ns, mean_bias2, "-o", label="bias$^2$")
plt.axhline(SIGMA**2, color="#757575", ls="--", label="aleatoric $\\sigma^2$ (irreducible)")
plt.xscale("log"); plt.yscale("log")
plt.xlabel("training set size $N$"); plt.ylabel("error contribution (log scale)")
plt.legend()
plt.show()

Connecting back to risk

The expected risk

R(\boldsymbol{\theta})

from the empirical-risk page is just the input-averaged total error of these plots. The aleatoric

\sigma^2

is the Bayes error, the best risk achievable by any model, and reaching it is the goal. Everything above it is epistemic and therefore reducible: more data shrinks the variance, and choosing capacity / regularization optimally (the

\lambda^\ast

from the regression section) balances bias against variance. Empirical risk minimization only ever sees one training set, so it cannot observe the epistemic variance directly, which is why we estimate it by resampling, exactly as the empirical-risk page notes.

Edit this page on GitHub or file an issue.

Foundations

Learning & Regression

Optimization

Maximum Likelihood

Classification

Dimensionality Reduction

Aleatoric and Epistemic Uncertainty

Two ways to write the total uncertainty

The data-generating process

Estimating epistemic uncertainty by retraining

Predictive uncertainty bands

The pointwise error breakdown

Aleatoric is irreducible; epistemic vanishes with data

Connecting back to risk

​Two ways to write the total uncertainty

​The data-generating process

​Estimating epistemic uncertainty by retraining

​Predictive uncertainty bands

​The pointwise error breakdown

​Aleatoric is irreducible; epistemic vanishes with data

​Connecting back to risk

Two ways to write the total uncertainty

The data-generating process

Estimating epistemic uncertainty by retraining

Predictive uncertainty bands

The pointwise error breakdown

Aleatoric is irreducible; epistemic vanishes with data

Connecting back to risk