Visualizing the regression function - the conditional mean
It is now instructive to go over an example to understand that even the plain-old mean squared error (MSE), the objective that is common in the regression setting, falls under the same umbrella - it’s the cross entropy between and a Gaussian model. Please follow the discussion associated with Section 5.5.1 of Ian Goodfellow’s Deep Learning book or section 20.2.4 of Russell & Norvig’s book and consider the following figure for assistance to visualize the relationship of and .
Key Insight: MSE as Cross-Entropy
When we assume the model distribution is Gaussian: The negative log-likelihood becomes: This is proportional to the mean squared error (MSE). Therefore, minimizing MSE is equivalent to maximum likelihood estimation under a Gaussian noise assumption. Key references: (Frazier, 2018; Bengio et al., 2015; Blei et al., 2016; Martin-Maroto & Polavieja, 2018; Beygelzimer et al., 2015)References
- Bengio, E., Bacon, P., Pineau, J., Precup, D. (2015). Conditional Computation in Neural Networks for faster models.
- Beygelzimer, A., Hazan, E., Kale, S., Luo, H. (2015). Online Gradient Boosting.
- Blei, D., Kucukelbir, A., McAuliffe, J. (2016). Variational Inference: A Review for Statisticians.
- Frazier, P. (2018). A Tutorial on Bayesian Optimization.
- Martin-Maroto, F., Polavieja, G. (2018). Algebraic Machine Learning.

