images/learning-problem.c4.yaml
The description below is taken from Vladimir Vapnik’s classic book Statistical Learning Theory, albeit with some enhancements to the terminology to make it more in line with our needs.
The generator is a source of situations that determines the environment in which the target distribution (Vapnik calls it the supervisor) and the learning algorithm act. Here we consider the simplest environment: the data generator produces vectors independently and identically distributed (i.i.d.) according to some unknown (but fixed) distribution .
The vector is the input to the target, which produces output values . This target is unknown but we know that it exists and does not change. There are two complementary ways to describe it. In the deterministic (function-approximation) view it is an unknown function that maps each to an output. In the statistical (probabilistic) view it is a conditional distribution from which the output is drawn for a given ; the deterministic case is then the special case in which this distribution concentrates around a single value (a function corrupted by, at most, noise).
The learning algorithm observes data drawn randomly and independently from the joint distribution . The empirical (sampling) distribution, denoted , produces examples of inputs and targets (labels) .
During what is called training, the learning algorithm constructs an approximation to this unknown target, one in each view. In the deterministic view the approximation is a hypothesis function , drawn from a hypothesis set and chosen so that . In the statistical view it is a parametric model family , and learning tunes the parameters so that approaches ; closeness is measured by a divergence, and maximizing the likelihood of the data is equivalent to minimizing , the objective in the learning-algorithm block. The two coincide: the model’s point prediction, such as or its conditional mean, plays the role of . It is not yet clear why the conditional mean is the right point prediction, but this will become evident after looking at linear regression and its probabilistic viewpoint. The model is built iteratively, so the final hypothesis, obtained from the best possible parameters , produces the predicted label for any input . We move between the two descriptions as convenient: the probabilistic one for likelihoods, losses, and uncertainty; the functional one for approximation, capacity, and generalization.
The ability to optimally predict, according to a criterion, when observing data that we have never seen before, the test set, is called generalization. Note that in the literature supervised learning is also called inductive learning. Induction is reasoning from observed training cases to general rules (e.g. the final hypothesis function), which are then applied to the test cases.
In summary, to learn we need three components:
- Data that may be stored (batch) or streamed (online).
- An algorithm that optimizes an objective (or loss) function
- A hypothesis set
References
- Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M., Pfau, D., et al. (2016). Learning to learn by gradient descent by gradient descent.
- Anselmi, F., Leibo, J., Rosasco, L., Mutch, J., Tacchetti, A., et al. (2013). Unsupervised Learning of Invariant Representations in Hierarchical Architectures.
- Lopez-Paz, D., Bottou, L., Schölkopf, B., Vapnik, V. (2015). Unifying distillation and privileged information.
- Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics.

