Text normalization

Normalizing text before storing or processing it allows for separation of concerns, since input is guaranteed to be consistent before operations are performed on it.

Numbers, dates, acronyms, and abbreviations are non-standard "words" that need to be pronounced differently depending on context.

For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing diacritical marks; and if "john" is to match "John", the text would be converted to a single case.

For simple, context-independent normalization, such as removing non-alphanumeric characters or diacritical marks, regular expressions would suffice.

[6][7] In the field of textual scholarship and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of scribal abbreviations and the transliteration of the archaic glyphs typically found in manuscript and early printed sources.