Naive Bayes classifier

Maximum-likelihood training can be done by evaluating a closed-form expression,[2]: 718 which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

[2][3] Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set.

There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods.

In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers.

[5] An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

This means that under the above independence assumptions, the conditional distribution over the class variable

[8] The assumptions on distributions of features are called the "event model" of the naive Bayes classifier.

For discrete features like the ones encountered in document classification (include spam filtering), multinomial and Bernoulli distributions are popular.

Some literature suggests that this is required in order to use naive Bayes, but it is not true, as the discretization may throw away discriminative information.

This method, which was introduced by John and Langley,[8] can boost the accuracy of the classifier considerably.

Estimating the parameters in log space is advantageous since multiplying a large number of small values can lead to significant rounding error.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero, because the probability estimate is directly proportional to the number of occurrences of a feature's value.

Rennie et al. discuss problems with the multinomial assumption in the context of document classification and possible ways to alleviate those problems, including the use of tf–idf weights instead of raw term frequencies and document length normalization, to produce a naive Bayes classifier that is competitive with support vector machines.

is a Boolean expressing the occurrence or absence of the i'th term from the vocabulary, then the likelihood of a document given a class

Given a way to train a naive Bayes classifier from labeled data, it's possible to construct a semi-supervised training algorithm that can learn from a combination of labeled and unlabeled data by running the supervised learning algorithm in a loop:[15] Convergence is determined based on improvement to the model likelihood

[15] Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes classifier has several properties that make it surprisingly useful in practice.

This helps alleviate problems stemming from the curse of dimensionality, such as the need for data sets that scale exponentially with the number of features.

While naive Bayes often fails to produce a good estimate for the correct class probabilities,[16] this may not be a requirement for many applications.

In this manner, the overall classifier can be robust enough to ignore serious deficiencies in its underlying naive probability model.

[17] Other reasons for the observed success of the naive Bayes classifier are discussed in the literature cited below.

In the case of discrete inputs (indicator or frequency features for discrete events), naive Bayes classifiers form a generative-discriminative pair with multinomial logistic regression classifiers: each naive Bayes classifier can be considered a way of fitting a probability model that optimizes the joint likelihood

The link between the two can be seen by observing that the decision function for naive Bayes (in the binary case) can be rewritten as "predict class

The left-hand side of this equation is the log-odds, or logit, the quantity predicted by the linear model that underlies logistic regression.

Discriminative classifiers have lower asymptotic error than generative ones; however, research by Ng and Jordan has shown that in some practical cases naive Bayes can outperform logistic regression because it reaches its asymptotic error faster.

[18] Problem: classify whether a given person is a male or a female based on the measured features.

In order to classify the sample, one has to determine which posterior is greater, male or female.

Consider the problem of classifying documents by their content, for example into spam and non-spam e-mails.

In the case of two mutually exclusive alternatives (such as this example), the conversion of a log-likelihood ratio to a probability takes the form of a sigmoid curve: see logit for details.)