Skip to main content
Let’s reflect on the MSE and how model complexity gives rise to various generalization errors. MSE=E[(y^iyi)2]=Bias(y^i)2+Var(y^i)MSE = \mathbb{E}[(\hat{y}_i - y_i)^2] = \mathrm{Bias}(\hat{y}_i)^2 + \mathrm{Var}(\hat{y}_i) This means that the MSE captures both bias and variance of the estimated target variables and as shown in the plots above, increasing model capacity can really increase the variance of y^\hat{y}. We have seen that as the w\mathbf{w} is trying to exactly fit, or memorize, the data, it minimizes the bias (in fact for model complexity M=9 the bias is 0) but it also exhibits significant variability that is itself translated to y^\hat{y}. Although the definition of model capacity is far more rigorous, we will broadly associate complexity with capacity and borrow the figure below from Ian Goodfellow’s book to demonstrate the tradeoff between bias and variance. What we have done with regularization is to find the λ\lambda that minimized generalization error aka. find the optimal model capacity.
Generalization vs Capacity

Bias and Variance Decomposition during the training process

Apart from the composition of the generalization error for various model capacities, it is interesting to make some general comments regarding the decomposition of the generalization error (also known as empirical risk) during training. Early in training the bias is large because the predictor output is far from the target function. The variance is very small because the data has had little influence yet. Late in training the bias is small because the predictor has learned the underlying function. However if train for too long then the predictor will also have learned the noise specific to the dataset (overfitting). In such case the variance will be large because the noise varies between training and test datasets. Notably for deep learning models, the risk curve is exhibiting a pleasant property that indicates that despite the increase in capacity, the generalization error is increasing initially and then enters a regime where the error decreases avoiding overfitting as shown in the figure below.
New Bias-Variance Risk Curve
Key references: (Keskar et al., 2016; Keskar et al., 2016; Bottou et al., 2016; Goodfellow et al., 2014; Dauphin et al., 2014)

References

  • Bottou, L., Curtis, F., Nocedal, J. (2016). Optimization Methods for Large-Scale Machine Learning.
  • Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., et al. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization.
  • Goodfellow, I., Vinyals, O., Saxe, A. (2014). Qualitatively characterizing neural network optimization problems.
  • Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.
  • Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. (2016). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.