Han unification

Han characters are a feature shared in common by written Chinese (hanzi), Japanese (kanji), Korean (hanja) and Vietnamese (chữ Hán).

In the formulation of Unicode, an attempt was made to unify these variants by considering them as allographs – different glyphs representing the same "grapheme" or orthographic unit – hence, "Han unification", with the resulting character repertoire sometimes contracted to Unihan.

[1][a] Nevertheless, many characters have regional variants assigned to different code points, such as Traditional 個 (U+500B) versus Simplified 个 (U+4E2A).

However, Han unification has also caused considerable controversy, particularly among the Japanese public, who, with the nation's literati, have a history of protesting the culling of historically and culturally significant variants.

Today, the list of characters officially recognized for use in proper names continues to expand at a modest pace.)

In 1993, the Japan Electronic Industries Development Association (JEIDA) published a pamphlet titled "未来の文字コード体系に私達は不安をもっています" (We are feeling anxious for the future character encoding system JPNO 20985671), summarizing major criticism against the Han Unification approach adopted by Unicode.

In contrast, consider ASCII's unification of punctuation and diacritics, where graphemes with widely different meanings (for example, an apostrophe and a single quotation mark) are unified because the glyphs are the same.

Since the Unihan standard encodes "abstract characters", not "glyphs", the graphical artifacts produced by Unicode have been considered temporary technical hurdles, and at most, cosmetic.

However, again, particularly in Japan, due in part to the way in which Chinese characters were incorporated into Japanese writing systems historically, the inability to specify a particular variant was considered a significant obstacle to the use of Unicode in scholarly work.

For example, the unification of "grass" (explained above), means that a historical text cannot be encoded so as to preserve its peculiar orthography.

Instead, for example, the scholar would be required to locate the desired glyph in a specific typeface in order to convey the text as written, defeating the purpose of a unified character set.

Besides making some Unicode fonts unusable for texts involving multiple "Unihan languages", names or other orthographically sensitive terminology might be displayed incorrectly.

While this may be considered primarily a graphical representation or rendering problem to be overcome by more artful fonts, the widespread use of Unicode would make it difficult to preserve such distinctions.

The initial design goal was to create a 16-bit standard,[14] and Han unification was therefore a critical step for avoiding tens of thousands of character duplications.

The controversy later extended to the internationally representative ISO: the initial CJK Joint Research Group (CJK-JRG) favored a proposal (DIS 10646) for a non-unified character set, "which was thrown out in favor of unification with the Unicode Consortium's unified character set by the votes of American and European ISO members" (even though the Japanese position was unclear).

Unfortunately, language-specific fonts also make it difficult to access a variant which, as with the "grass" example, happens to appear more typically in another language style.

Traditional Chinese characters are used in Hong Kong and Taiwan (Big5) and they are, with some differences, more familiar to Korean and Japanese users.)

There are several alternative character sets that are not encoding according to the principle of Han Unification, and thus free from its restrictions: These region-dependent character sets are also seen as not affected by Han Unification because of their region-specific nature: However, none of these alternative standards has been as widely adopted as Unicode, which is now the base character set for many new standards and protocols, internationally adopted, and is built into the architecture of operating systems (Microsoft Windows, Apple macOS, and many Unix-like systems), programming languages (Perl, Python, C#, Java, Common Lisp, APL, C, C++), and libraries (IBM International Components for Unicode (ICU) along with the Pango, Graphite, Scribe, Uniscribe, and ATSUI rendering engines), font formats (TrueType and OpenType) and so on.

Paradoxically, Unicode considers 兩 and 両 to be near identical z-variants while at the same time classifying them as significantly different semantic variants.

According to Unicode's definitions, it makes sense that all simplifications (that do not result in wholly different characters being merged for their homophony) will be a form of semantic variant.

This conflicts with the stated goal of Unicode to take away that overhead, and to allow any number of any of the world's scripts to be on the same document with one encoding system.

"[11] This leaves the option to settle on one unified reference grapheme for all z-variants, which is contentious since few outside of Japan would recognize 佛 and 仏 as equivalent.

However, each column is marked (by the lang attribute) as being in a different language: Chinese (simplified and two types of traditional), Japanese, Korean, or Vietnamese.

Almost all of the variants that the PRC developed or standardized got distinct code points owing simply to the fortune of the Simplified Chinese transition carrying through into the computing age.

Sixty-two Shinjitai "simplified" characters with distinct code points in Japan got merged with their Kyūjitai traditional equivalents, like 海.

Both 紅 (U+7D05) and 红 (U+7EA2) got separate code points in the PRC's text encoding standards bodies so Chinese-language documents could use both versions.

The fact that almost every other change brought about by the PRC, no matter how minor, did warrant its own code point suggests that this exception may have been unintentional.

Likewise, to users of one CJK language reading a document with "foreign" glyphs: variants of 骨 can appear as mirror images, 者 can be missing a stroke/have an extraneous stroke, and 令 may be unreadable to Non-Japanese people.

In some cases, often where the changes are the most striking, Unicode has encoded variant characters, making it unnecessary to switch between fonts or lang attributes.

As an example, take a character such as 入 (U+5165), for which the only way to display the variants is to change font (or lang attribute) as described in the previous table.

Differences for the same Unicode code point (U+8FD4) in regional versions of Source Han Sans
The Latin lowercase " a " has widely differing glyphs that all represent concrete instances of the same abstract grapheme. Although a native reader of any language using the Latin script recognizes these two glyphs as the same grapheme, to others they might appear to be completely unrelated.