Language identification

Computational approaches to this problem view it as a special case of text categorization, solved with various statistical methods.

[citation needed] Mutual information based distance measure is essentially equivalent to more conventional model-based methods and is not generally considered to be either novel or better than simpler techniques.

Also problematic for any approach are pieces of input text that are composed of several languages, as is common on the Web.

An older statistical method by Grefenstette was based on the prevalence of certain function words (e.g., "the" in English).

Similar languages like Bulgarian and Macedonian or Indonesian and Malay present significant lexical and structural overlap, making it challenging for systems to discriminate between them.