Bayesian Optimization

In Bayesian regression or classification, we start with a prior

p_{model}(w)

over the parameters

w

of the

p_{model}(y|x, w)

and form the initial posterior

p_{model}(w|x,y)

using the Bayes rule. As the data are sequentially observed, the posterior is iteratively updated using the previous posterior and the new likelihood function

p_{model}(y|x,w)

after receiving new data points. We can then form the predictive distribution and make predictions given new values of x:

p_{model}(y|x) = \int p_{model}(y|x,w)~p_{model}(w|x,y) ~ dw

Bayesian regression / classification is more complex than a plain MLE regression / classification. No matter which of the two methods we use, we will always have hyperparameters in the learning process and evaluating the average loss is too expensive to obtain as it can only be obtained after training/validation/test runs for each permutation of hyperparameters. When we have such as setting, we can apply the Bayesian Optimization (BO) approach that builds a probabilistic model of the underlying loss function. We can make the following analogies:

Instead of the data, $x$ now symbolizes the hyperparameters we try at each iteration.
Instead of the label, $y$ now symbolizes the loss we get at each iteration.

So we have a regression problem of

x

y

like before but now the semantics are different. There is an underlying target

p_{data}(y|x)

function that we never get to know and we will try to approximate it with a

p_{model}

. We can now apply the sequential Bayesian approach in this problem. According to Bayes,

p_{model}(y|x) = \frac{p_{model}(x|y)p_{model}(y)}{p_{model}(x)}

The prior component

p_{model}(y)

that can be chosen from a family of models known to define a probability distribution over functions (the space of loss functions in its range

\mathbb R

). Gaussian Process Regression (GPR) with a radial basis function (RBF) or Matern kernels are commonly used. In Bayesian optimization the

p_{model}(y)

is called the surrogate model. If y* is the best loss observed so far, then we can define an approximation of the loss function that we call acquisition function - an analytical heuristic that will allow us to use in the place of the expensive loss function. This is called the Expected Improvement (EI) and is given by

EI(x) = \mathbb E [\max(y-y^*, 0)]

where

y

is sampled from the posterior. Optimizing the IE function with respect to

x

leads to new candidate values of

x

that we can try and see if they result in a new

y^*

in a training/validation/test exercise.

References

BOTorch Overview

Edit this page on GitHub or file an issue.

Regularization in Deep Neural Networks

Hyperparameter Optimization Workshop

Neural Networks

Backpropagation

Whitening

Normalization

Regularization

Hyperparameter Optimization

Transfer Learning

Bayesian Optimization

References

​References

References