GB 18030

GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312.

[7]: 534 More code points are now associated with characters due to update of Unicode, especially the appearance of CJK Unified Ideographs Extension B.

The remaining six kept the two-byte PUA mappings, so that a change to the 4-byte sequence is needed to follow the non-PUA preference.

[7]: 3 This version also includes the full CJK Unified Ideographs Extension B in the 4-byte encoding section which is outside the BMP[10] as a suggestion support requirement.

[14] However, as the inclusion of CJK Unified Ideographs Extension B in a 4-byte region is required to be maintained during information processing, software can no longer get away with treating characters as 16-bit fixed width entities (UCS-2).

This version matches with Unicode 3.1, and also provided support for Hangul (Korean), Mongolian (including Manchu, Clear script, Sibe hergen, Galik), Tai Nuea, Tibetan, Uyghur/Kazakh/Kyrgyz and Yi.

The third and latest version, GB 18030-2022 Information Technology—Chinese coded character set, mandates the suggestion support part of CJK Unified Ideographs Extension B in GB 18030-2005, along with updates up to Unicode 11.0 including Kangxi Radicals and CJK Unified Ideographs URO, Extension C, D, E and F. Additional languages are also recognized by GB 18030-2022 such as part of Arabic, Tai Le, New Tai Lue, Tai Tham, Lisu, and Miao.

[15][16][17] Originally, in late 2022, it would have placed 897 new sinographic characters in Plane 10 (hexadecimal: 0A), a yet-untitled astral Unicode plane, for citizen real-name certification in China, but eventually the repertoire (reduced to 622 characters after expert review) was fast-tracked into Unicode 15.1 in September 2023, as the CJK Unified Ideographs Extension I block.

GB 18030 inherits the bad aspects of GBK, most notably needing special code to safely find ASCII characters in a GB18030 sequence.

This gives a total of 1,587,600 (126×10×126×10) possible 4-byte sequences, which is easily sufficient to cover Unicode's 1,112,064 (17×65536 − 2048 surrogates) assigned, reserved, and noncharacter code points.

[h] For example: An offset table is used in the WHATWG and W3C version of GB 18030 to efficiently translate code points.

[20] ICU[19] and glibc use similar range definitions to avoid wasting space on large sequential blocks.

Loading will fail or cause corrupted result if the file contains characters that do not exist in GBK (see § Technical details for examples).

[25] GNU libiconv, an alternative iconv implementation frequently used on non-glibc UNIX-like environments like Cygwin, supports GB 18030 since version 1.4.

GB 18030 compliance certification only requires correct handling and recognition of glyphs in the mandatory (two-byte, and CJK Ext.