Skip to main content
Let us start with a classic formal definition of the supervised learning problem. The supervised learning problem: a data generator p_data(x) emits inputs x to an unknown target distribution p_data(y|x), which produces outputs y. The inputs and outputs are sampled into an empirical distribution p-hat_data(x,y) that feeds (x,y) pairs to a learning algorithm whose maximum-likelihood objective equals minimizing the KL divergence to the data. A hypothesis set H supplies the model family p_model, which is split to both the learning algorithm and a loss function L; the algorithm returns the optimal parameters theta-star that define the final hypothesis p_model(y|x,theta-star), used to predict y-hat. Vapnik’s formulation of the learning problem Editable diagram source: images/learning-problem.c4.yaml The description below is taken from Vladimir Vapnik’s classic book Statistical Learning Theory, albeit with some enhancements to the terminology to make it more in line with our needs. The generator is a source of situations that determines the environment in which the target distribution (Vapnik calls it the supervisor) and the learning algorithm act. Here we consider the simplest environment: the data generator produces vectors xX\mathbf{x} \in \mathcal{X} independently and identically distributed (i.i.d.) according to some unknown (but fixed) distribution pdata(x)p_{data}(\mathbf{x}). The vector x\mathbf{x} is the input to the target, which produces output values yY\mathbf{y} \in \mathcal{Y}. This target is unknown but we know that it exists and does not change. There are two complementary ways to describe it. In the deterministic (function-approximation) view it is an unknown function ff that maps each x\mathbf{x} to an output. In the statistical (probabilistic) view it is a conditional distribution pdata(yx)p_{data}(\mathbf{y} \mid \mathbf{x}) from which the output y\mathbf{y} is drawn for a given x\mathbf{x}; the deterministic case is then the special case in which this distribution concentrates around a single value (a function corrupted by, at most, noise). The learning algorithm observes data drawn randomly and independently from the joint distribution pdata(x,y)=pdata(yx)pdata(x)p_{data}(\mathbf{x}, \mathbf{y}) = p_{data}(\mathbf{y} \mid \mathbf{x})\, p_{data}(\mathbf{x}). The empirical (sampling) distribution, denoted p^data\hat{p}_{data}, produces examples of inputs x\mathbf{x} and targets (labels) y\mathbf{y}. D={(x1,y1),,(xm,ym)}\mathcal{D} = \{ (\mathbf{x}_1, \mathbf{y}_1), \dots, (\mathbf{x}_m, \mathbf{y}_m) \} During what is called training, the learning algorithm constructs an approximation to this unknown target, one in each view. In the deterministic view the approximation is a hypothesis function gg, drawn from a hypothesis set H\mathcal{H} and chosen so that gfg \approx f. In the statistical view it is a parametric model family pmodel(yx,θ)p_{model}(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}), and learning tunes the parameters θ\boldsymbol{\theta} so that pmodelp_{model} approaches pdata(yx)p_{data}(\mathbf{y} \mid \mathbf{x}); closeness is measured by a divergence, and maximizing the likelihood of the data is equivalent to minimizing DKL(p^datapmodel)D_{KL}(\hat{p}_{data} \,\|\, p_{model}), the objective in the learning-algorithm block. The two coincide: the model’s point prediction, such as argmaxypmodel(yx,θ)\arg\max_{\mathbf{y}} p_{model}(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}^*) or its conditional mean, plays the role of g(x)g(\mathbf{x}). It is not yet clear why the conditional mean is the right point prediction, but this will become evident after looking at linear regression and its probabilistic viewpoint. The model is built iteratively, so the final hypothesis, obtained from the best possible parameters θ\boldsymbol{\theta}^*, produces the predicted label y^\hat{\mathbf{y}} for any input xi\mathbf{x}_i. We move between the two descriptions as convenient: the probabilistic one for likelihoods, losses, and uncertainty; the functional one for approximation, capacity, and generalization. The ability to optimally predict, according to a criterion, when observing data that we have never seen before, the test set, is called generalization. Note that in the literature supervised learning is also called inductive learning. Induction is reasoning from observed training cases to general rules (e.g. the final hypothesis function), which are then applied to the test cases. In summary, to learn we need three components:
  1. Data that may be stored (batch) or streamed (online).
  2. An algorithm that optimizes an objective (or loss) function
  3. A hypothesis set
We now have everything in place to start addressing learning tasks such as regression and classification. Key references: (Sohl-Dickstein et al., 2015; Lopez-Paz et al., 2015; Anselmi et al., 2013; Andrychowicz et al., 2016)

References

  • Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M., Pfau, D., et al. (2016). Learning to learn by gradient descent by gradient descent.
  • Anselmi, F., Leibo, J., Rosasco, L., Mutch, J., Tacchetti, A., et al. (2013). Unsupervised Learning of Invariant Representations in Hierarchical Architectures.
  • Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V. (2015). Unifying distillation and privileged information.
  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics.