Sliding window based part-of-speech tagging

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

A high percentage of words in a natural language are words which out of context can be assigned more than one part of speech.

The percentage of these ambiguous words is typically around 30%, although it depends greatly on the language.

Solving this problem is very important in many areas of natural language processing.

For example in machine translation changing the part-of-speech of a word can dramatically change its translation.

Sliding window based part-of-speech taggers are programs which assign a single part-of-speech to a given lexical form of a word, by looking at a fixed sized "window" of words around the word to be disambiguated.

The two main advantages of this approach are: Let be the set of grammatical tags of the application, that is, the set of all possible tags which may be assigned to a word, and let be the vocabulary of the application.

Let be a function for morphological analysis which assigns each

, that can be implemented by a full-form lexicon, or a morphological analyser.

Let be the set of word classes, that in general will be a partition of

belong to the same ambiguity class.

This allows good performance for high frequency ambiguous words, and doesn't require too many parameters for the tagger.

With these definitions it is possible to state problem in the following way: Given a text

(either by using the lexicon or morphological analyser) in order to get an ambiguously tagged text

The job of the tagger is to get a tagged text

: Using Bayes formula, this is converted into: where

is the probability that this tag corresponds to the text

In a Markov model, these probabilities are approximated as products.

The syntactic probabilities are modelled by a first order Markov process: where

Lexical probabilities are independent of context: One form of tagging is to approximate the first probability formula: where

In this way the sliding window algorithm only has to take into account a context of size

For example to tag the ambiguous word "run" in the sentence "He runs from danger", only the tags of the words "He" and "from" are needed to be taken into account.