ISO/IEC 2022

ISO/IEC 2022 Information technology—Character code structure and extension techniques, is an ISO/IEC standard in the field of character encoding.

[4] ISO 2022 specifies a general structure which character encodings can conform to, dedicating particular ranges of bytes (0x00–1F and 0x7F–9F) to be used for non-printing control codes[5] for formatting and in-band instructions (such as line breaks or formatting instructions for text terminals), rather than graphical characters.

[6] Specific sets of control codes and escape sequences designed to be used with ISO 2022 include ISO/IEC 6429, portions of which are implemented by ANSI.SYS and terminal emulators.

ISO 2022 itself also defines particular control codes and escape sequences which can be used for switching between different coded character sets (for example, between ASCII and the Japanese JIS X 0208) so as to use multiple in a single document,[7] effectively combining them into a single stateful encoding (a feature less important since the advent of Unicode).

Other writing systems with relatively few characters, such as Greek, Cyrillic, Arabic or Hebrew, as well as forms of the Latin script using diacritics or letters absent from the ISO Basic Latin alphabet, have historically been represented on personal computers with different 8-bit, single byte, extended ASCII encodings, which follow ASCII when the most significant bit is 0 (i.e. bytes 0x00–7F, when represented in hexadecimal), and include additional characters for a most significant bit of 1 (i.e. bytes 0x80–FF).

Although many of the mechanisms defined by the ISO/IEC 2022 standard are infrequently used, several established encodings are based on a subset of the ISO/IEC 2022 system.

[11] More specialised applications of ISO 2022 include the MARC-8 encoding system used in MARC 21 library records.

Specific profiles such as ISO-2022-JP may impose extra conditions, such as that the current character set is reset to US-ASCII before the end of a line.

Encoding byte values ("bit combinations") are often given in column-line notation, where two decimal numbers in the range 00–15 (each corresponding to a single hexadecimal digit) are separated by a slash.

[30] The first I byte, or absence thereof, determines the type of escape sequence; it might, for instance, designate a working set, or denote a single control function.

In all types of escape sequences, F bytes in the range 0x30–0x3F are reserved for unregistered private uses defined by prior agreement between parties.

[38] Use of the backspace and carriage return in this manner is permitted by ISO/IEC 646 but prohibited by ISO/IEC 4873 / ECMA-43[39] and by ISO/IEC 8859,[40][41] on the basis that it leaves the graphical character repertoire undefined.

[45] Inclusion of transmission control characters in the C0 set, besides the ten included by ISO 6429 / ECMA-48 (namely SOH, STX, ETX, EOT, ENQ, ACK, DLE, NAK, SYN and ETB),[47] or inclusion of any of those ten in the C1 set, is also prohibited by the ISO/IEC 2022 / ECMA-35 standard.

The set invoked over each area may also be modified with control codes referred to as shifts, as shown in the table below.

[81] The textual representation of the escape sequence, included in the third element of the FPI, will be recognised by SGML implementations for supported character sets.

[86] The manner in which DRCS sets and associated fonts are transmitted, allocated and managed is not stipulated by ISO/IEC 2022 / ECMA-35 itself, although it recommends allocating them sequentially starting with F byte 0x40 (@);[87] however, a manner for transmitting DRCS fonts is defined within some telecommunication protocols such as World System Teletext.

[100] All existing registrations of the latter type (as of 2019) are either transparent raw data, Unicode/UCS formats, or subsets thereof.

[107] For specifying encodings by labels, the X Consortium's Compound Text format defines five private-use DOCS sequences.

Announcement sequences are as follows: Six 7-bit ISO 2022 code versions (ISO-2022-CN, ISO-2022-CN-EXT, ISO-2022-JP, ISO-2022-JP-1, ISO-2022-JP-2 and ISO-2022-KR) are defined by IETF RFCs, of which ISO-2022-JP and ISO-2022-KR have been extensively used in the past.

[110] Although UTF-8 is the preferred encoding in HTML5, legacy content in ISO-2022-JP remains sufficiently widespread that the WHATWG encoding standard retains support for it,[111] in contrast to mapping ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT[112] entirely to the replacement character,[113] due to concerns about code injection attacks such as cross-site scripting.

[114][i] Use of ESC ( I to switch to the JIS X 0201-1976 Kana set (1 byte per character) is not part of the ISO-2022-JP profile,[114] but is also sometimes used.

Python allows it in a variant which it labels ISO-2022-JP-EXT (which also incorporates JIS X 0212 as described below, completing coverage of EUC-JP);[117][118] this is close in both name and structure to an encoding denoted ISO-2022-JPext by DEC, which furthermore adds a two-byte user-defined region accessed with ESC $ ( 0 to complete the coverage of Super DEC Kanji.

[115] ISO-2022-JP-2 is a multilingual extension of ISO-2022-JP, defined in RFC 1554 (dated 1993), which permits the following escape sequences in addition to the ISO-2022-JP ones.

The ISO/IEC 8859 parts are 96-character sets which cannot be designated to G0, and are accessed from G2 using the 7-bit escape sequence form of the single-shift code SS2:[121] ISO-2022-JP with the ISO-2022-JP-2 representation of JIS X 0212, but not the other extensions, was subsequently dubbed ISO-2022-JP-1 by RFC 2237, dated 1997.

Notably, the WHATWG Encoding Standard used by HTML5 maps ISO-2022-KR, ISO-2022-CN and ISO-2022-CN-EXT (as well as HZ-GB-2312) to the "replacement" decoder,[112] which maps all input to the replacement character (�), in order to prevent certain cross-site scripting and related attacks, which utilize a difference in encoding support between the client and server.

[113] Although the same security concern (allowing sequences of ASCII bytes to be interpreted differently) also applies to ISO-2022-JP and UTF-16, they could not be given this treatment due to being much more frequently used in deployed content.

[111] In April 2024, a security flaw[137] was found in the implementation of ISO-2022-CN-EXT in glibc, which lead to recommendations to disable the encoding entirely on Linux systems.

[138] A subset of ISO 2022 applied to 8-bit single-byte encodings is defined by ISO/IEC 4873, also published by Ecma International as ECMA-43.

Each ISO/IEC 4873 level has its own single ISO/IEC 2022 announcer sequence, which are as follows:[147] Extended Unix Code (EUC) is an 8-bit variable-width character encoding system used primarily for Japanese, Korean, and simplified Chinese.

[148] The X Consortium defined an ISO 2022 profile named Compound Text as an interchange format in 1989.