Chinese character IT

When an ASCII input code is typed on the English keyboard, the software will search for matching Chinese characters in the table.

To make the input method easy to learn, encoding must be based on distinctive features in forms, sounds or meanings of Chinese characters.

Because the meanings of characters tend to be more abstract and complicated, input encoding is normally based on the sound or form.

[5] Sound-based encoding is normally based on an existing Latin character scheme for Chinese phonetics, such as pinyin for Putonghua, and Jyutping for Cantonese.

The input code of a Chinese character is its pinyin letter string followed by an optional number representing the tone.

[9] They are usually classified into five categories of heng (一), shu (丨), pie (丿), dian (丶) and zhe (𠃍) for dictionary consultancy and Chinese input on a mobile phone.

For Chinese input with an ASCII keyboard, 2 strokes can be combined to form 5*5=25 different pairs for mapping to the English letters.

For example, in input method ZYQ,[10] the sequence of stroke pairs '一一, 一丨, 一丿, ..., 𠃍丿, 𠃍丶, 𠃍𠃍' are represented by 'a, b, c, ..., w, x, y' respectively.

Popular form-based encoding methods include Wubi on the mainland and Cangjie in Taiwan and Hong Kong.

The major advantage of form-based methods lies in their low degree of duplicate encoding, enabling high speed input of Chinese characters.

In addition, they have to learn the complicated rules for breaking a character into a sequence of components and making a selection among them.

[13] Compared with English, Chinese OCR and handwriting recognition is more difficult, because there are thousands of different commonly-used characters instead of 26 letters.

There are two problems: the variation in pronunciation of words by different speakers and the existence of homophones such as 'pair', 'pear' and 'pare' in English, and 攻势, 公式, 公示 (gong1shi4) in Chinese.

[14] The most important feature of intelligent input is application of contextual constraints for candidate characters selection.

Though the non-diacritical pinyin letters of 大学 and 大雪 are both "daxue", the computer can make a reasonable selection based on the subsequent words.

[15] In the Chinese writing system, there are graphemes other than complete Chinese characters, such as punctuation marks (e.g. '。', '、' and '《》'), strokes (e.g. '丿', '𠃍' and '乚'), radicals (e.g. '氵', '宀' and '刂'), and letters used for romanization, like the vowel letters with diacritics used in pinyin and the Yale romanization of Cantonese.

The following sections will introduce the most important encoding standards used in Chinese information technology, including GB, Big5 and Unicode.

It includes 6,763 Chinese characters, with 3,755 frequently-used ones sorted by Pinyin, and the rest by radicals (indexing components).

[18] Big5 encoding was designed by five big IT companies in Taiwan in the early 1980s, and has been the de facto standard for representing traditional Chinese in computers ever since.

The Basic Multilingual Plane (BMP) is a 2-byte kernel version of Unicode with 2^16=65,536 code points for important characters of many languages.

HKSCS was developed by the Hong Kong government as a collection of locally specific Chinese characters not available on the computer in the early days, for instance 咗 (already), 嘢 (thing), 脷 (tongue), and 曱甴 (cockroach).

This problem is often solved by manual selection of encoding or character set (such as the case on Web browsers) or by code conversion beforehand.

The most popular Chinese fonts are the Song (宋体), Kai (楷体), Hei (黑体) and Fangsong (仿宋体) families.