Linguistic categories

Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection.

Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences.

Work on stochastic methods for tagging Koine Greek (DeRose 1990) has used over 1,000 parts of speech and found that about as many words were ambiguous in that language as in English.

For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines.

The EAGLES guidelines provide guidance for markup to be used with text corpora, particularly for identifying features relevant in computational linguistics and lexicography.

Numerous companies, research centres, universities and professional bodies across the European Union collaborated to produce the EAGLES Guidelines, which set out recommendations for de facto standards and rules of best practice for:[3] The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe.

Petrov et al.[5][6] have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.

[10] The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology,[11] frame semantics[12] and coreference.

[13] For phrase structure syntax, a comparable effort does not seem to exist, but the specifications of the Penn Treebank have been applied to (and extended for) a broad range of languages,[14] e.g., Icelandic,[15] Old English,[16] Middle English,[17] Middle Low German,[18] Early Modern High German,[19] Yiddish,[20] Portuguese,[21] Japanese,[22] Arabic[23] and Chinese.

[27] The RELISH project created a mirror of the 2010 edition of GOLD as a Data Category Selection within ISOcat.

As of 2018, GOLD data remains an important terminology hub in the context of the Linguistic Linked Open Data cloud, but as it is not actively maintained anymore, its function is increasingly replaced by OLiA (for linguistic annotation, building on GOLD and ISOcat) and lexinfo.net (for dictionary metadata, building on ISOcat).

[28][29][30] An earlier implementation of this standard, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology (see below).

[37] In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,[40] GOLD,[40] ISOcat,[41] CLARIN Concept Registry,[42] Universal Dependencies,[43] lexinfo,[43] etc., they thus enable interoperability between these vocabularies.