Perceptron

A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

The artificial neuron network was invented in 1943 by Warren McCulloch and Walter Pitts in A logical calculus of the ideas immanent in nervous activity.

[6][7] Later, he obtained funding by the Information Systems Branch of the United States Office of Naval Research and the Rome Air Development Center, to build a custom-made computer, the Mark I Perceptron.

[8] The machine was "part of a previously secret four-year NPIC [the US' National Photographic Interpretation Center] effort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters".

[10] His organization of a perceptron is constructed of three kinds of cells ("units"): AI, AII, R, which stand for "projection", "association" and "response".

He presented at the first international symposium on AI, Mechanisation of Thought Processes, which took place in 1958 November.

[16] In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

[22] Among the variants are: The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research.

In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function.

It is often incorrectly believed that they also conjectured that a similar result would hold for a multi-layer perceptron network.

However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function.

Nevertheless, the often-miscited Minsky and Papert text caused a significant decline in interest and funding of neural network research.

By the time of its completion, simulation on digital computers had become faster than purpose-built perceptron machines.

The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al.[27] Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Freund and Schapire (1998),[1] and more recently by Mohri and Rostamizadeh (2013) who extend previous results and give new and more favorable L1 bounds.

In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when

A perceptron network with one hidden layer can learn to classify any compact subset arbitrarily closely.

input units, one hidden layer, and one output, similar to the Mark I Perceptron machine.

Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions.

[35] If the training set is linearly separable, then the perceptron is guaranteed to converge after making finitely many mistakes.

[37] The perceptron of optimal stability, nowadays better known as the linear support-vector machine, was designed to solve this problem (Krauth and Mezard, 1987).

It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications.

The Maxover algorithm (Wendemuth, 1995) is "robust" in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set.

[42] In the linearly separable case, it will solve the training problem – if desired, even with optimal stability (maximum margin between the classes).

For non-separable data sets, it will return a solution with a computable small number of misclassifications.

[43] In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps.

The perceptron of optimal stability, together with the kernel trick, are the conceptual foundations of the support-vector machine.

Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit).

Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is overfitted.

Like most other techniques for training linear classifiers, the perceptron generalizes naturally to multiclass classification.

Since 2002, perceptron training has become popular in the field of natural language processing for such tasks as part-of-speech tagging and syntactic parsing (Collins, 2002).