Skip to main content
The discussion in the marginal distribution is equivalently applicable to the conditional distribution pmodel(yx,w)p_{model}(\mathbf y \mid \mathbf x, \mathbf w) which governs supervised learning, yy being the symbol of the label / target variable. Therefore all machine learning software frameworks offer excellent APIs on CE calculation. L(w)=Ex,yp^datalogpmodel(yx;w)L(\mathbf w) = - \mathbb{E}_{\mathbf x, \mathbf y \sim \hat p_{data}} \log p_{model}(\mathbf y \mid \mathbf x; \mathbf w) The attractiveness of the ML solution is that the CE (also known as log-loss) is general and we don’t need to re-design it when we change the model.

Visualizing the regression function - the conditional mean

It is now instructive to go over an example to understand that even the plain-old mean squared error (MSE), the objective that is common in the regression setting, falls under the same umbrella - it’s the cross entropy between p^data\hat p_{data} and a Gaussian model. Please follow the discussion associated with Section 5.5.1 of Ian Goodfellow’s Deep Learning book or section 20.2.4 of Russell & Norvig’s book and consider the following figure for assistance to visualize the relationship of pdatap_{data} and pmodelp_{model}. A regression curve f(x; w) with a Gaussian p_model centered on it at each input x; the green dashed line traces the conditional mean of the Gaussian model. The green dashed line shows the mean of the pmodelp_{model} distribution. Replace the y-axis target variable tt with yy.

Key Insight: MSE as Cross-Entropy

When we assume the model distribution is Gaussian: pmodel(yx;w)=N(yf(x;w),σ2)p_{model}(\mathbf y \mid \mathbf x; \mathbf w) = \mathcal{N}(\mathbf y \mid f(\mathbf x; \mathbf w), \sigma^2) The negative log-likelihood becomes: logpmodel(yx;w)=12σ2(yf(x;w))2+const-\log p_{model}(\mathbf y \mid \mathbf x; \mathbf w) = \frac{1}{2\sigma^2}(\mathbf y - f(\mathbf x; \mathbf w))^2 + \text{const} This is proportional to the mean squared error (MSE). Therefore, minimizing MSE is equivalent to maximum likelihood estimation under a Gaussian noise assumption.

What the point estimate leaves out

Maximum likelihood returns a single w^\hat{\mathbf w}, so it gives you one conditional-mean curve f(x;w^)f(\mathbf x; \hat{\mathbf w}) together with the fixed noise variance σ2\sigma^2. Each of these is one source of predictive uncertainty:
  • The σ2\sigma^2 in the Gaussian is the aleatoric uncertainty. It is irreducible by construction: the model assumes the noise is constant, so it appears as the 12σ2\frac{1}{2\sigma^2} scaling and the additive const, never as something optimization can drive to zero.
  • What a point estimate cannot show is how much the curve f(x;w^)f(\mathbf x; \hat{\mathbf w}) itself would shift if you refit on a different training sample. That spread is the epistemic uncertainty, and a single w^\hat{\mathbf w} is silent about it.
The aleatoric and epistemic uncertainty page picks up exactly here. It treats this Gaussian as the data-generating truth and decomposes the total predictive variance into the σ2\sigma^2 floor plus the spread of the fitted model across resampled training sets, the term a point-estimate MLE discards.

References