KS X 1001

KS X 1001, "Code for Information Interchange (Hangul and Hanja)",[d][1] formerly called KS C 5601, is a South Korean coded character set standard to represent Hangul and Hanja characters on a computer.

It contains Korean Hangul syllables, CJK ideographs (Hanja), Greek, Cyrillic, Japanese (Hiragana and Katakana) and some other characters.

KS X 1001 is arranged as a 94×94 table, following the structure of 2-byte code words in ISO 2022 and EUC.

[3] It is an ISO 2022 compatible encoding, typically used in EUC form, which assigns double-byte codes for non-Hangul, Hangul jamo, and the most common Hangul syllables, in contrast to Johab (조합; Johap; lit.

combining)[1] which is not compatible with ISO 2022, but assigns double-byte codes to all Hangul syllables using modern jamo.

[2] Wansung is technically a variable-length encoding, allowing other syllables to be represented with eight-byte sequences (using the jamo and Hangul Filler character), but this feature is not always implemented.

[4] The earliest edition of KS C 5601, published in 1974,[2] defined a variable-length[2] 7-bit character set which assigned single-byte code points to 51[3] basic Hangul jamo, somewhat analogously to JIS C 6220, in an encoding known as "N-byte Hangul".

[1][5] It was published in response to industry use of Johab as a competing encoding to Wansung, being used at the time by Hangul Word Processor.

These all have the drawback that they only assign codes for the 2350 precomposed Hangul syllables which have their own KS X 1001 codepoints (out of 11172 in total, not counting those using obsolete jamo), and require others to use eight-byte composition sequences, which are not supported by some partial implementations of the standard.

Some operating systems extend this standard in other non-uniform ways, e.g. the EUC-KR extensions MacKorean on the classic Mac OS, and IBM-949 by IBM.

To illustrate vendor differences in implementation, multiple Unicode mappings are shown for some characters.

Encodings which combine KS X 1001 with single-byte ASCII may use alternative Unicode mapping to the Halfwidth and Fullwidth Forms block for the backslash.

Contrast the third rows of KPS 9566 and of JIS X 0208, which follow the ISO 646 layout but only include letters and digits.

This set contains Roman numerals and basic support for the Greek alphabet, without diacritics or the final sigma.

[16] This row contains unit symbols as single characters, including those which consist of multiple letters.

This set contains the modern Russian alphabet, and is not necessarily sufficient to represent other forms of the Cyrillic script.

Initial+vowel+final syllables 뢨, 썅, 쏀, 쓩, and 쭁 are included but their initial+vowel counterparts 뢔, 쌰, 쎼, 쓔, and 쬬 are not.

This represents a Hangul syllable as the sequence of three five-bit values, split across two 8-bit bytes, most significant bit first.

The most significant bit of the lead byte is always set (allowing combination with single-byte ASCII or KS X 1003).

[24] Other, vendor-defined, Johab variants also exist; for example, IBM defines one for use as a Shift Out set with EBCDIC.

That variant uses shift in and shift out to switch between a single-byte EBCDIC page and Johab, uses a different encoding for the non-Hangul characters (using lead bytes 0x40–6C with a different layout), and uses lead bytes 0xD4–DD as a user-defined region, but uses the same Johab layout as the 1992 standard for the Hangul characters when in shift-out state.

(A screenshot of an old version of Firefox showing Big5, GB2312, GBK, GB18030, HZ, ISO-2022-CN, Big5-HKSCS, EUC-TW, EUC-JP, ISO-2022-JP, Shift_JIS, EUC-KR, UHC, Johab and ISO-2022-KR as available encodings under the CJK sub-menu.)
Various CJK encodings, including four based on KS X 1001, supported by Mozilla Firefox as of 2004. (This support has been reduced in later versions to avoid certain cross site scripting attacks.)
Diagram of Johab encoding as stipulated by KS X 1001
  • 한글 : Hangul
  • 한자 : Hanja
  • 특수문자 : special characters (non-Hangul and non-Hanja characters)
Layout of EBCDIC-based Johab variant when in double-byte state