Unicode and HTML

Web pages authored using HyperText Markup Language (HTML) may contain multilingual text represented with the Unicode universal character set.

The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.

The accurate representation of text in web pages from different natural languages and writing systems is complicated by the details of character encoding, markup language syntax, font, and varying levels of support by web browsers.

Both types of documents consist, at a fundamental level, of characters, which are graphemes and grapheme-like units, independent of how they manifest in computer storage systems and networks.

It is also possible to use UTF-16 where most characters are stored as two bytes with varying endianness, which is supported by modern browsers but less commonly used.

To ensure better compatibility with older browsers, it is still a common practice to convert the hexadecimal code point into a decimal value (for example 合 instead of 合).

Although any Unicode character can be referenced by its numeric code point, some HTML document authors prefer to use these named entities instead, where possible, as they are less cryptic and were better supported by early browsers.

For a browser from a location where legacy multi-byte character encodings are prevalent, some form of auto-detection is likely to be applied.

Because of the legacy of 8-bit text representations in programming languages and operating systems and the desire to avoid burdening users with the need to understand the nuances of encoding, many text editors used by HTML authors are unable or unwilling to offer a choice of encodings when saving files to disk and often do not even allow input of characters beyond a very limited range.

Another factor contributing in the same direction, is the arrival of UTF-8 – which greatly diminishes the need for other encodings, and thus modern editors tends to default, as recommended by the HTML5 specification,[1] to UTF-8.

No additional metadata mechanisms are required for these encodings since the byte-order mark includes all of the information necessary for processing applications.

In most circumstances, the byte-order mark character is handled by editing applications separately from the other characters so there is little risk of an author removing or otherwise changing the byte order mark to indicate the wrong encoding (as can happen when the encoding is declared in English/Latin script).

The fact that the manual override is present and widely used hinders the adoption of accurate encoding declarations on the Web; therefore the problem is likely to persist.

But note that Internet Explorer, Chrome and Safari – for both XML and text/html serializations – do not permit the encoding to be overridden whenever the page includes the BOM.

[2] For HTML documents serialized with the preferred XML label – application/xhtml+xml, manual encoding override is not permitted.

They will correctly display any mix of Unicode blocks, as long as appropriate fonts are present in the operating system.