Perceptron

A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class.

The artificial neuron network was invented in 1943 by Warren McCulloch and Walter Pitts in A logical calculus of the ideas immanent in nervous activity.

[6][7] Later, he obtained funding by the Information Systems Branch of the United States Office of Naval Research and the Rome Air Development Center, to build a custom-made computer, the Mark I Perceptron.

[8] The machine was "part of a previously secret four-year NPIC [the US' National Photographic Interpretation Center] effort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters".

[10] His organization of a perceptron is constructed of three kinds of cells ("units"): AI, AII, R, which stand for "projection", "association" and "response".

He presented at the first international symposium on AI, Mechanisation of Thought Processes, which took place in 1958 November.

[16] In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

[22] Among the variants are: The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research.

In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function.

It is often incorrectly believed that they also conjectured that a similar result would hold for a multi-layer perceptron network.

However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function.

Nevertheless, the often-miscited Minsky and Papert text caused a significant decline in interest and funding of neural network research.

By the time of its completion, simulation on digital computers had become faster than purpose-built perceptron machines.

The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al.[27] Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Freund and Schapire (1998),[1] and more recently by Mohri and Rostamizadeh (2013) who extend previous results and give new and more favorable L1 bounds.

In words, one perceptron unit can almost certainly memorize a random assignment of binary labels on N points when

A perceptron network with one hidden layer can learn to classify any compact subset arbitrarily closely.

input units, one hidden layer, and one output, similar to the Mark I Perceptron machine.

Nonetheless, the learning algorithm described in the steps below will often work, even for multilayer perceptrons with nonlinear activation functions.

[35] If the training set is linearly separable, then the perceptron is guaranteed to converge after making finitely many mistakes.

[37] The perceptron of optimal stability, nowadays better known as the linear support-vector machine, was designed to solve this problem (Krauth and Mezard, 1987).

It can be used also for non-separable data sets, where the aim is to find a perceptron with a small number of misclassifications.

The Maxover algorithm (Wendemuth, 1995) is "robust" in the sense that it will converge regardless of (prior) knowledge of linear separability of the data set.

[42] In the linearly separable case, it will solve the training problem – if desired, even with optimal stability (maximum margin between the classes).

For non-separable data sets, it will return a solution with a computable small number of misclassifications.

[43] In all cases, the algorithm gradually approaches the solution in the course of learning, without memorizing previous states and without stochastic jumps.

The perceptron of optimal stability, together with the kernel trick, are the conceptual foundations of the support-vector machine.

Another way to solve nonlinear problems without using multiple layers is to use higher order networks (sigma-pi unit).

Indeed, if we had the prior constraint that the data come from equi-variant Gaussian distributions, the linear separation in the input space is optimal, and the nonlinear solution is overfitted.

Like most other techniques for training linear classifiers, the perceptron generalizes naturally to multiclass classification.

Since 2002, perceptron training has become popular in the field of natural language processing for such tasks as part-of-speech tagging and syntactic parsing (Collins, 2002).

Mark I Perceptron machine, the first implementation of the perceptron algorithm. It was connected to a camera with 20×20 cadmium sulfide photocells to make a 400-pixel image. The main visible feature is the sensory-to-association plugboard, which sets different combinations of input features. To the right are arrays of potentiometers that implemented the adaptive weights. [ 2 ] : 213
The Mark 1 Perceptron, being adjusted by Charles Wightman (Mark I Perceptron project engineer). [ 3 ] Sensory units at left, association units in center, and control panel and response units at far right. The sensory-to-association plugboard is behind the closed panel to the right of the operator. The letter "C" on the front panel is a display of the current state of the sensory input. [ 4 ]
Organization of a biological brain and a perceptron.
Components of the Mark I Perceptron. From the operator's manual. [ 16 ]
Isometric view of Tobermory Phase I. [ 25 ]
The appropriate weights are applied to the inputs, and the resulting weighted sum passed to a function that produces the output o.
A diagram showing a perceptron updating its linear boundary as more training examples are added
Illustration of the perceptron convergence. In the picture, . All data points have , since the negative samples are equivalent to after reflection through the origin. As the learning proceeds, the weight vector performs a somewhat random walk in the space of weights. Each step is at least 90 degrees away from its current direction, thus increasing its norm-square by at most . Each step adds to by a point in the samples, and since all the samples have , the weight vector must move along by at least . Since the norm grows like but the -component grows like , this would eventually force the weight vector to point almost entirely in the direction, and thus achieve convergence.
Two classes of points, and two of the infinitely many linear boundaries that separate them. Even though the boundaries are at nearly right angles to one another, the perceptron algorithm has no way of choosing between them.