The Supervised (Inductive) Learning Problem Statement

Let us start with a classic formal definition of the supervised learning problem.

The supervised learning problem: a data generator p_data(x) emits inputs x to an unknown target distribution p_data(y|x), which produces outputs y. The inputs and outputs are sampled into an empirical distribution p-hat_data(x,y) that feeds (x,y) pairs to a learning algorithm whose maximum-likelihood objective equals minimizing the KL divergence to the data. A hypothesis set H supplies the model family p_model, which is split to both the learning algorithm and a loss function L; the algorithm returns the optimal parameters theta-star that define the final hypothesis p_model(y|x,theta-star), used to predict y-hat.

Vapnik’s formulation of the learning problem Editable diagram source: images/learning-problem.c4.yaml The description below is taken from Vladimir Vapnik’s classic book Statistical Learning Theory, albeit with some enhancements to the terminology to make it more in line with our needs. The generator is a source of situations that determines the environment in which the target distribution (Vapnik calls it the supervisor) and the learning algorithm act. Here we consider the simplest environment: the data generator produces vectors

\mathbf{x} \in \mathcal{X}

independently and identically distributed (i.i.d.) according to some unknown (but fixed) distribution

p_{data}(\mathbf{x})

. The vector

\mathbf{x}

is the input to the target, which produces output values

\mathbf{y} \in \mathcal{Y}

. This target is unknown but we know that it exists and does not change. There are two complementary ways to describe it. In the deterministic (function-approximation) view it is an unknown function

f

that maps each

\mathbf{x}

to an output. In the statistical (probabilistic) view it is a conditional distribution

p_{data}(\mathbf{y} \mid \mathbf{x})

from which the output

\mathbf{y}

is drawn for a given

\mathbf{x}

; the deterministic case is then the special case in which this distribution concentrates around a single value (a function corrupted by, at most, noise). The learning algorithm observes data drawn randomly and independently from the joint distribution

p_{data}(\mathbf{x}, \mathbf{y}) = p_{data}(\mathbf{y} \mid \mathbf{x})\, p_{data}(\mathbf{x})

. The empirical (sampling) distribution, denoted

\hat{p}_{data}

, produces examples of inputs

\mathbf{x}

and targets (labels)

\mathbf{y}

\mathcal{D} = \{ (\mathbf{x}_1, \mathbf{y}_1), \dots, (\mathbf{x}_m, \mathbf{y}_m) \}

During what is called training, the learning algorithm constructs an approximation to this unknown target, one in each view. In the deterministic view the approximation is a hypothesis function

g

, drawn from a hypothesis set

\mathcal{H}

and chosen so that

g \approx f

. In the statistical view it is a parametric model family

p_{model}(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta})

, and learning tunes the parameters

\boldsymbol{\theta}

so that

p_{model}

approaches

p_{data}(\mathbf{y} \mid \mathbf{x})

; closeness is measured by a divergence, and maximizing the likelihood of the data is equivalent to minimizing

D_{KL}(\hat{p}_{data} \,\|\, p_{model})

, the objective in the learning-algorithm block. The two coincide: the model’s point prediction, such as

\arg\max_{\mathbf{y}} p_{model}(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}^*)

or its conditional mean, plays the role of

g(\mathbf{x})

. It is not yet clear why the conditional mean is the right point prediction, but this will become evident after looking at linear regression and its probabilistic viewpoint. The model is built iteratively, so the final hypothesis, obtained from the best possible parameters

\boldsymbol{\theta}^*

, produces the predicted label

\hat{\mathbf{y}}

for any input

\mathbf{x}_i

. We move between the two descriptions as convenient: the probabilistic one for likelihoods, losses, and uncertainty; the functional one for approximation, capacity, and generalization. The ability to optimally predict, according to a criterion, when observing data that we have never seen before, the test set, is called generalization. Note that in the literature supervised learning is also called inductive learning. Induction is reasoning from observed training cases to general rules (e.g. the final hypothesis function), which are then applied to the test cases. In summary, to learn we need three components:

Data that may be stored (batch) or streamed (online).
An algorithm that optimizes an objective (or loss) function
A hypothesis set

We now have everything in place to start addressing learning tasks such as regression and classification. Key references: (Sohl-Dickstein et al., 2015; Lopez-Paz et al., 2015; Anselmi et al., 2013; Andrychowicz et al., 2016)

References

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M., Pfau, D., et al. (2016). Learning to learn by gradient descent by gradient descent.
Anselmi, F., Leibo, J., Rosasco, L., Mutch, J., Tacchetti, A., et al. (2013). Unsupervised Learning of Invariant Representations in Hierarchical Architectures.
Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V. (2015). Unifying distillation and privileged information.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

Edit this page on GitHub or file an issue.

​References

References