Character encoding

[3] The history of character codes illustrates the evolving need for machine-mediated character-based symbolic information over a distance, using once-novel electrical means.

The earliest codes were based upon manual and hand-written encoding and cyphering systems, such as Bacon's cipher, Braille, international maritime signal flags, and the 4-digit encoding of Chinese characters for a Chinese telegraph code (Hans Schjellerup, 1869).

With the adoption of electrical and electro-mechanical techniques these earliest codes were adapted to the new capabilities and limitations of the early machines.

Since the punched card code then in use only allowed digits, upper-case English letters and a few special characters, six bits were sufficient.

IBM's BCD encodings were the precursors of their Extended Binary-Coded Decimal Interchange Code (usually abbreviated as EBCDIC), an eight-bit encoding scheme developed in 1963 for the IBM System/360 that featured a larger character set, including lower case letters.

While Fieldata addressed many of the then-modern issues (e.g. letter and digit codes arranged for machine collation), it fell short of its goals and was short-lived.

In trying to develop universally interchangeable character encodings, researchers in the 1980s faced the dilemma that, on the one hand, it seemed necessary to add more bits to accommodate additional characters, but on the other hand, for the users of the relatively small character set of the Latin alphabet (who still constituted the majority of computer users), those additional bits were a colossal waste of then-scarce and expensive computing resources (as they would always be zeroed out for such users).

In 1985, the average personal computer user's hard disk drive could store only about 10 megabytes, and it cost approximately US$250 on the wholesale market (and much higher if purchased separately at retail),[8] so it was very important at the time to make every bit count.

The compromise solution that was eventually found and developed into Unicode[vague] was to break the assumption (dating back to telegraph codes) that each character should always directly correspond to a particular sequence of bits.

Instead, characters would first be mapped to a universal intermediate representation in the form of abstract numbers called code points.

The term "code page" is not used in Unix or Linux, where "charmap" is preferred, usually in the larger context of locales.

Some writing systems, such as Arabic and Hebrew, need to accommodate things like graphemes that are joined in different ways in different contexts, but represent the same semantic character.

The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.

For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" by 66, and so on.

A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system).

The range of valid code points (the codespace) for the Unicode standard is U+0000 to U+10FFFF, inclusive, divided in 17 planes, identified by the numbers 0 to 16.

The following table shows examples of code point values: Consider a string of the letters "ab̲c𐐀"—that is, a string containing a Unicode combining character (U+0332 ̲ COMBINING LOW LINE) as well as a supplementary character (U+10400 𐐀 DESERET CAPITAL LETTER LONG I).

Although each of those forms uses the same total number of bits (32) to represent the glyph, it is not obvious how the actual numeric byte values are related.