Character encodings in HTML

When an HTML document includes special characters outside the range of seven-bit ASCII, two goals are worth considering: the information's integrity, and universal browser display.

In Chinese, Japanese, and Korean (CJK) language environments where there are several different multi-byte encodings in use, auto-detection is also often employed.

Finally, browsers usually permit the user to override incorrect charset label manually as well.

UTF-16 or UTF-32, which can be used for all languages as well, are less widely used because they can be harder to handle in programming languages that assume a byte-oriented ASCII superset encoding, and they are less efficient for text with a high frequency of ASCII characters, which is usually the case for HTML documents.

This is intended to prevent attacks (e.g. cross site scripting) which may exploit a difference between the client and server in what encodings are supported in order to mask malicious content.

[29] Although the same security concern applies to ISO-2022-JP and UTF-16, which also allow sequences of ASCII bytes to be interpreted differently, this approach was not seen as feasible for them since they are comparatively more frequently used in deployed content.

The character entity references <, >, " and & are predefined in HTML and SGML, because <, >, " and & are already used to delimit markup.

Incorrect HTML entity escaping may also open up security vulnerabilities for injection attacks such as cross-site scripting.

If HTML attributes are left unquoted, certain characters, most importantly whitespace, such as space and tab, must be escaped using entities.

For example, use of é (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined.