Mojibake

Symptoms of this failed rendering include blocks with the code point displayed in hexadecimal or using the generic replacement character.

A major source of trouble are communication protocols that rely on settings on each computer rather than sending or storing metadata together with the data.

Whereas Linux distributions mostly switched to UTF-8 in 2004,[2] Microsoft Windows generally uses UTF-16, and sometimes uses 8-bit code pages for text files in different languages.

Depending on the type of software, the typical solution is either configuration or charset detection heuristics, both of which are prone to mis-prediction.

The encoding of text files is affected by locale setting, which depends on the user's language and brand of operating system, among other conditions.

For Unicode, one solution is to use a byte order mark, but many parsers do not tolerate this for source code or other machine-readable text.

Likewise, many early operating systems do not support multiple encoding formats and thus will end up displaying mojibake if made to display non-standard text—early versions of Microsoft Windows and Palm OS for example, are localized on a per-country basis and will only support encoding standards relevant to the country the localized version will be sold in, and will display mojibake if a file containing a text in a different encoding format from the version that the OS is designed to support is opened.

The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game.

Commodore brand 8-bit computers used PETSCII encoding, particularly notable for inverting the upper and lower case compared to standard ASCII.

The additional characters are typically the ones that become corrupted, making texts only mildly unreadable with mojibake: ... and their uppercase counterparts, if applicable.

However, with the advent of UTF-8, mojibake has become more common in certain scenarios, e.g. exchange of text files between UNIX and Windows computers, due to UTF-8's incompatibility with Latin-1 and Windows-1252.

Some users transliterate their writing when using a computer, either by omitting the problematic diacritics, or by using digraph replacements (å → aa, ä/æ → ae, ö/ø → oe, ü → ue etc.).

As an example, the Norwegian football player Ole Gunnar Solskjær had his last name spelled "SOLSKJAER" on his uniform when he played for Manchester United.

It is common to respond to a corrupted e-mail with the nonsense phrase "Árvíztűrő tükörfúrógép" (literally "Flood-resistant mirror-drilling machine") which contains all accented characters used in Hungarian.

Polish companies selling early DOS computers created their own mutually-incompatible ways to encode Polish characters and simply reprogrammed the EPROMs of the video cards (typically CGA, EGA, or Hercules) to provide hardware code pages with the needed glyphs for Polish—arbitrarily located without reference to where other computer sellers had placed them.

The situation began to improve when, after pressure from academic and user groups, ISO 8859-2 succeeded as the "Internet standard" with limited support of the dominant vendors' software (today largely replaced by Unicode).

With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as krzaczki ([ˈkʂät͜ʂ.ki], lit.

[6] The Soviet Union and early Russian Federation developed KOI encodings (Kod Obmena Informatsiey, Код Обмена Информацией, which translates to "Code for Information Exchange").

It is for this reason that KOI8 text, even Russian, remains partially readable after stripping the eighth bit, which was considered as a major advantage in the age of 8BITMIME-unaware email systems.

For example, attempting to view non-Unicode Cyrillic text using a font that is limited to the Latin alphabet, or using the default ("Western") encoding, typically results in text that consists almost entirely of capitalized vowels with diacritical marks (e.g. KOI8 "Библиотека" (biblioteka, library) becomes "âÉÂÌÉÏÔÅËÁ", while "Школа русского языка" (shkola russkogo yazyka, Russian-language school) becomes "ûËÏÌÁ ÒÕÓÓËÏÇÏ ÑÚÙËÁ").

Croatian, Bosnian, Serbian (the seceding varieties of Serbo-Croatian language) and Slovenian add to the basic Latin alphabet the letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć, Ž (only č/Č, š/Š and ž/Ž are officially used in Slovenian, although others are used when needed, mostly in foreign names).

When confined to basic ASCII (most user names, for example), common replacements are: š→s, đ→dj, č→c, ć→c, ž→z (capital forms analogously, with Đ→Dj or Đ→DJ depending on word case).

For example, if the Swedish word kärlek is encoded in Windows-1252 but decoded using GBK, it will appear as "k鋜lek", where "är" is parsed as "鋜".

In some rare cases, an entire text string which happens to include a pattern of particular word lengths, such as the sentence "Bush hid the facts", may be misinterpreted.

In Vietnamese, the phenomenon is called chữ ma (Hán–Nôm: 𡨸魔, "ghost characters") or loạn mã (from Chinese 乱码, luànmǎ).

In Vietnam, chữ ma was commonly seen on computers that ran pre-Vista versions of Windows or cheap mobile phones.

In Mac OS and iOS, the muurdhaja l (dark l) and 'u' combination and its long form both yield wrong shapes.

[citation needed] Some Indic and Indic-derived scripts, most notably Lao, were not officially supported by Windows XP until the release of Vista.

[15] Due to these ad hoc encodings, communications between users of Zawgyi and Unicode would render as garbled text.

Various other writing systems native to West Africa present similar problems, such as the N'Ko alphabet, used for Manding languages in Guinea, and the Vai syllabary, used in Liberia.