Text segmentation

In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm.

In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

However even in English this problem is not trivial due to the use of the full stop character for abbreviations, which may or may not also terminate a sentence.

When processing plain text, tables of abbreviations that contain periods can help prevent incorrect assignment of sentence boundaries.

As with word segmentation, not all written languages contain punctuation characters that are useful for approximating sentence boundaries.

Many different approaches have been tried:[3][4] e.g. HMM, lexical chains, passage similarity using word co-occurrence, clustering, topic modeling, etc.

It is quite an ambiguous task – people evaluating the text segmentation systems often differ in topic boundaries.

When punctuation and similar clues are not consistently available, the segmentation task often requires fairly non-trivial techniques, such as statistical decision-making, large dictionaries, as well as consideration of syntactic and semantic constraints.

There are two general approaches: Some text segmentation systems take advantage of any markup like HTML and know document formats like PDF to provide additional evidence for sentence and paragraph boundaries.