Viola–Jones object detection framework

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones.

Each classifier is a single perceptron with several binary masks (Haar features).

The algorithm is efficient for its time, able to detect faces in 384 by 288 pixel images at 15 frames per second on a conventional 700 MHz Intel Pentium III.

While it has lower accuracy than more modern methods such as convolutional neural network, its efficiency and compact size (only around 50k parameters, compared to millions of parameters for typical CNN like DeepFace) means it is still used in cases with limited computational power.

For example, in the original paper,[1] they reported that this face detector could run on the Compaq iPAQ at 2 fps (this device has a low power StrongARM without floating point hardware).

To make the task more manageable, the Viola–Jones algorithm only detects full view (no occlusion), frontal (no head-turning), upright (no rotation), well-lit, full-sized (occupying most of the frame) faces in fixed-resolution images.

The restrictions are not as severe as they appear, as one can normalize the picture to bring it closer to the requirements for Viola-Jones.

The "frontal" requirement is non-negotiable, as there is no simple transformation on the image that can turn a face from a side view to a frontal view.

Then one can at run time execute all these classifiers in parallel to detect faces at different view angles.

The "full-view" requirement is also non-negotiable, and cannot be simply dealt with by training more Viola-Jones classifiers, since there are too many possible ways to occlude a face.

Our task is to make a binary decision: whether it is a photo of a standardized face (frontal, well-lit, etc) or not.

Viola–Jones is essentially a boosted feature learning algorithm, trained by running a modified AdaBoost algorithm on Haar feature classifiers to find a sequence of classifiers

, the algorithm immediately returns "no face detected".

Each pattern must also be symmetric to x-reflection and y-reflection (ignoring the color change), so for example, for the horizontal white-black feature, the two rectangles must be of the same width.

The Haar features used in the Viola-Jones algorithm are a subset of the more general Haar basis functions, which have been used previously in the realm of image-based object detection.

[4] While crude compared to alternatives such as steerable filters, Haar features are sufficiently complex to match features of typical human faces.

using only constant number of additions and subtractions, regardless of the size of the rectangular features, using the summed-area table.

Perform a certain modified AdaBoost training on the set of all Haar feature classifiers of dimension

The modified AdaBoost algorithm would output a sequence of Haar feature classifiers

, the algorithm immediately returns "no face detected".

The speed with which features may be evaluated does not adequately compensate for their number, however.

For example, in a standard 24x24 pixel sub-window, there are a total of M = 162336[5] possible features, and it would be prohibitively expensive to evaluate them all when testing an image.

Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them.

Each weak classifier is a threshold function based on the feature

Here a simplified version of the learning algorithm is reported:[6] Input: Set of N positive and negative training images with their labels

A given sub-window is immediately discarded as not a face if it fails in any of the stages.

Because the activation of each classifier depends entirely on the behavior of its predecessor, the false positive rate for an entire cascade is: Similarly, the detection rate is: Thus, to match the false positive rates typically achieved by other detectors, each classifier can get away with having surprisingly poor performance.

At the same time, however, each classifier needs to be exceptionally capable if it is to achieve adequate detection rates.

Instead, one can use tracking algorithms like the KLT algorithm to detect salient features within the detection bounding boxes and track their movement between frames.

Not only does this improve tracking speed by removing the need to re-detect objects in each frame, but it improves the robustness as well, as the salient features are more resilient than the Viola-Jones detection framework to rotation and photometric changes.

Example rectangle features shown relative to the enclosing detection window
Haar Feature that looks similar to the bridge of the nose is applied onto the face
Haar Feature that looks similar to the eye region which is darker than the upper cheeks is applied onto a face