Stemming

Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

[citation needed] Her paper refers to three earlier major attempts at stemming algorithms, by Professor John W. Tukey of Princeton University, the algorithm developed at Harvard University by Michael Lesk, under the direction of Professor Gerard Salton, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.

A later stemmer was written by Martin Porter and was published in the July 1980 issue of the journal Program.

Dr. Porter received the Tony Kent Strix award in 2000 for his work on stemming and information retrieval.

To eliminate this source of error, Martin Porter released an official free software (mostly BSD-licensed) implementation[2] of the algorithm around the year 2000.

The replacement technique avoids the need for a separate stage in the process to recode or provide partial matching.

Paice also developed a direct measurement for comparing stemmers based on counting the over-stemming and under-stemming errors.

Some examples of the rules include: Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules.

Suffix stripping algorithms are sometimes regarded as crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run').

The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.

For example, given the English term friendlies, the algorithm may identify the ies suffix and apply the appropriate rule and achieve the result of friendl.

Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say).

Chances are that the brute force approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best.

This approach is highly conditional upon obtaining the correct lexical category (part of speech).

Stochastic algorithms involve using probability to identify the root form of a word.

This model is typically expressed in the form of complex linguistic rules, similar in nature to those in suffix stripping or lemmatisation.

Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).

A simple example is a suffix tree algorithm which first consults a lookup table using brute force.

If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.

English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex.

[citation needed] There are two error measurements in stemming algorithms, overstemming and understemming.

Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a false positive.

Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a false negative.

Stemming is used as an approximate method for grouping words with a similar basic meaning together.

[15] Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.