Morphological dictionary

Surface forms of words are those found in natural language text.

Columns are BASE, DERIVED, RULE)At the time of writing (2021), all of these are non-aligned morphological dictionaries (see below).

Their simplistic format is particularly well-suited for the application of machine learning techniques, and UniMorph in particular, has been subject of numerous shared tasks.

In rule-based morphological parsers, both lexicon and rules are normally formalized as finite state automata and subsequently combined.

They thus require morphological dictionaries with specific processing instructions (which often have a linguistic interpretation, but, technically, are just treated like arbitrary string symbols).

[3] Popular FST packages such as SFST[4] (as available from the fst package in Debian and Ubuntu) allow to define application-specific file formats for morphological lexica, that bundle different pieces of morphological information with every individual morpheme.

These are thus aligned morphological dictionaries, but very rich (and also, idiosyncratic) in structure.

nomInterlinear Glossed Text (IGT) is a popular formalism in language documentation, linguistic typology and other branches of linguistics and the philologies.

Although IGT can be created without any specialized software (but just with a conventional editor), such specialized software has been developed, with notable examples such as Toolbox,[6] the FieldWorks Language Explorer (FLEx)[7] or open source alternatives such as Xigt.

[8] Toolbox and FLEx support semi-automated annotation by means of an internal morphological dictionary.

FLEx and Toolbox provide different editor functionalities for annotating text and editing dictionaries, so that additional information beyond that found in annotations can be added, but at its core, their formats provide aligned morphological dictionaries.

Their formats of FLEx and Toolbox are not intended for human consumption, nor are they well-supported by any processing software other than their native tools.

OntoLex is a community standard for machine-readable dictionaries on the web.

In 2019, the OntoLex-Morph module has been proposed to facilitate data modelling of morphology in lexicography, as well as to provide a data model for morphological dictionaries for Natural Language Processing.

[9] OntoLex-Morph does support both aligned and non-aligned morphological dictionaries.

In an aligned morphological dictionary, the correspondence between the surface form and the lexical form of a word is aligned at the character level, for example: Where θ is the empty symbol and ⟨n⟩ signifies "noun", and ⟨pl⟩ signifies "plural".

is the alphabet of the output symbols, an aligned morphological dictionary is a subset

For example, "house" may be a noun in the singular, /haʊs/, or may be a verb in the present tense, /haʊz/.