Charset detection

[3] However, badly written charset detection routines do not run the reliable UTF-8 test first, and may decide that UTF-8 is some other encoding.

For example, it was common that web sites in UTF-8 containing the name of the German city München were shown as MÃ¼nchen, due to the code deciding it was an ISO-8859 encoding before (or without) even testing to see if it was UTF-8.

Common characters must be checked for, relying on a test to see that the text is valid UTF-16 fails: the Windows operating system would mis-detect the phrase "Bush hid the facts" (without a newline) in ASCII as Chinese UTF-16LE, since all the byte pairs matched assigned Unicode characters in UTF-16LE.

These are closely related eight-bit encodings that share an overlap in their lower half with ASCII and all arrangements of bytes are valid.

Even though UTF-8 and UTF-16 are easy to detect, some systems require UTF encodings to explicitly label the document with a prefixed byte order mark (BOM).