
Architecture of a single neuron
The perceptron algorithm invented 60 years ago by Frank Rosenblatt in Cornell Aeronautical Laboratory. Neural networks are constructed from neurons - each neuron is a perceptron with a specific activation function. A single neuron is itself capable of learning — indeed,various standard statistical methods can be viewed in terms of single neurons — so this model will serve as a first and simple example of a supervised neural network.


Perceptron Learning Algorithm
The algorithm is derived from the application of the SGD to a suitably chosen loss function. The loss function can be easily designed if we start thinking about the class labels as belonging to the set (rather than the more usual ) and considering the value of the products . If there are no classification errors for the chosen non-linear activation function above such products will result into positive numbers irrespectively of the class. For these cases we assign zero to the loss function. If there are errors however, these products will be negative and the sum of all these negative product terms we must maximize - or equivalently minimize the negative of such loss as below: We will find the that minimize such loss using the familiar Stochastic Gradient Descent algorithm. Noting that the gradient of the loss function at is we can write the SGD algorithm as follows: Let denote the iteration index and the learning rate.- Initialize the weights and the threshold. Weights may be initialized to zero or to a small random value.
- For each example in our training set, perform the following steps over the input and desired output :
- Update the weights:




NOTE: For offline learning, the second step may be repeated until the iteration error is less than a user-specified error threshold , or a predetermined number of iterations have been completed, where ”s” is the size of the training set.The perceptron is a linear classifier, therefore it will never get to the state with all the input vectors classified correctly if the training set D is not linearly separable, i.e. if the positive examples cannot be separated from the negative examples by a hyperplane. In this case, no “approximate” solution will be gradually approached under the standard learning algorithm, but instead learning will fail completely. Even in the case of linearly separable datasets, the algorithm may exhibit significant variance while it is executing as previously correctly classified examples may “fall” into the wrong decision region by an update that considers a currently misclassified example. Further, the perceptron solution will depend on the initial choices of the parameters as well as the order of the training dataset presented. Support Vector Machines avoid such pitfalls which can motivate the question why we insisted on learning the perceptron algorithm: both architecturally and the functionally the linear combination of features followed by a non-linearity is the fundamental building block of far more complicated neural networks.

