Maximum Likelihood Estimation of Conditional Models

The discussion in the marginal distribution is equivalently applicable to the conditional distribution

p_{model}(\mathbf y \mid \mathbf x, \mathbf w)

which governs supervised learning,

y

being the symbol of the label / target variable. Therefore all machine learning software frameworks offer excellent APIs on CE calculation.

L(\mathbf w) = - \mathbb{E}_{\mathbf x, \mathbf y \sim \hat p_{data}} \log p_{model}(\mathbf y \mid \mathbf x; \mathbf w)

The attractiveness of the ML solution is that the CE (also known as log-loss) is general and we don’t need to re-design it when we change the model.

Visualizing the regression function - the conditional mean

It is now instructive to go over an example to understand that even the plain-old mean squared error (MSE), the objective that is common in the regression setting, falls under the same umbrella - it’s the cross entropy between

\hat p_{data}

and a Gaussian model. Please follow the discussion associated with Section 5.5.1 of Ian Goodfellow’s Deep Learning book or section 20.2.4 of Russell & Norvig’s book and consider the following figure for assistance to visualize the relationship of

p_{data}

and

p_{model}

Key Insight: MSE as Cross-Entropy

When we assume the model distribution is Gaussian:

p_{model}(\mathbf y \mid \mathbf x; \mathbf w) = \mathcal{N}(\mathbf y \mid f(\mathbf x; \mathbf w), \sigma^2)

The negative log-likelihood becomes:

-\log p_{model}(\mathbf y \mid \mathbf x; \mathbf w) = \frac{1}{2\sigma^2}(\mathbf y - f(\mathbf x; \mathbf w))^2 + \text{const}

This is proportional to the mean squared error (MSE). Therefore, minimizing MSE is equivalent to maximum likelihood estimation under a Gaussian noise assumption.

References

Marginal Maximum Likelihood - Introduction to MLE for marginal distributions
MLE of Gaussian Parameters - Detailed derivation of MLE for Gaussian parameters
Section 5.5.1 - Conditional Log-Likelihood - Deep Learning Book (Goodfellow, Bengio, Courville)
Section 20.2.4 of Artificial Intelligence: A Modern Approach (Russell & Norvig)

Edit this page on GitHub or file an issue.

Foundations

Learning & Regression

Maximum Likelihood

Classification

Dimensionality Reduction

Maximum Likelihood Estimation of Conditional Models

Visualizing the regression function - the conditional mean

Key Insight: MSE as Cross-Entropy

References

Foundations

Learning & Regression

Maximum Likelihood

Classification

Dimensionality Reduction

​Visualizing the regression function - the conditional mean

​Key Insight: MSE as Cross-Entropy

​References

Visualizing the regression function - the conditional mean

Key Insight: MSE as Cross-Entropy

References