Part-of-speech tagging

A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.

Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English.

In Europe, tag sets from the Eagles Guidelines see wide use and include versions for multiple languages.

Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences.

At the other extreme, Petrov et al.[3] have proposed a "universal" tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, and so on).

It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications.

Its results were repeatedly reviewed and corrected by hand, and later users sent in errata so that by the late 70s the tagging was nearly perfect (allowing for some cases on which even human speakers might not agree).

This corpus has been used for innumerable studies of word-frequency and of part-of-speech and inspired the development of similar "tagged" corpora in many other languages.

This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.

In the mid-1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English.

Eugene Charniak points out in Statistical techniques for natural language parsing (1997)[4] that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns will approach 90% accuracy because many words are unambiguous, and many others only rarely represent their less-common parts of speech.

[6] In 1987, Steven DeRose[7] and Kenneth W. Church[8] independently developed dynamic programming algorithms to solve the same problem in vastly less time.

DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.

CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare.

Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction.

In 2014, a paper reporting using the structure regularization method for part-of-speech tagging, achieving 97.36% on a standard benchmark dataset.