Naive Bayes spam filtering

They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

Although naive Bayesian filters did not become popular until later, multiple programs were released in 1998 to address the growing problem of unwanted email.

[2] Variants of the basic technique have been implemented in a number of research works and commercial software products.

Bayes' theorem is used several times in the context of spam: Let's suppose the suspected message contains the word "replica".

Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches.

Statistics[7] show that the current probability of any message being spam is 80%, at the very least: The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email.

This assumption permits simplifying the general formula to: This is functionally equivalent to asking, "what percentage of occurrences of the word 'replica' appear in spam messages?"

For these approximations to make sense, the set of learned messages needs to be big and representative enough.

Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are independent events.

On this basis, one can derive the following formula from Bayes' theorem: where: Spam filtering software based on this formula is sometimes referred to as a naive Bayes classifier, as "naive" refers to the strong independence assumptions between the features.

More generally, the words that were encountered only a few times during the learning phase cause a problem, because it would be an error to trust blindly the information they provide.

Applying again Bayes' theorem, and assuming the classification between spam and ham of the emails containing a given word ("replica") is a random variable with beta distribution, some programs decide to use a corrected probability: where: (Demonstration:[9]) This corrected probability is used instead of the spamicity in the combining formula.

Some software products take into account the fact that a given word appears several times in the examined message,[10] others don't.

This method gives more sensitivity to context and eliminates the Bayesian noise better, at the expense of a bigger database.

There are other ways of combining individual probabilities for different words than using the "naive" approach.

A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns.

The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email.

As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules.

As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones.

[12] Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked.

The spam filter is usually unable to analyze this picture, which would contain the sensitive words like «Viagra».

A solution used by Google in its Gmail email system is to perform an OCR (Optical Character Recognition) on every mid to large size image, analyzing the text inside.

[13][14] While Bayesian filtering is used widely to identify spam email, the technique can classify (or "cluster") almost any sort of data.

One example is a general purpose classification program called AutoClass which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice.