GSM 03.38

Note that the second part of the table is only accessible if the GSM device supports the 7-bit extension mechanism, using the ESC character prefix.

Most of the high part of the table is not used in the default character set, but the GSM standard defines some language code indicators that allows the system to identify national variants of this part, to support more characters than those displayed in the above table.

In a standard GSM text message, all characters are encoded using 7-bit code units, packed together to fill all bits of octets.

[5] This works, because for characters in the Basic Multilingual Plane (including full alphabets of most modern human languages) UCS-2 and UTF-16 encodings are identical.

To encode characters outside of the BMP (unreachable in plain UCS-2), such as Emoji, UTF-16 uses surrogate pairs, which when decoded with UCS-2 would appear as two valid but unmapped code points.

The default is to use the 7-bit encoding described above, until one enters a character that is not present in the GSM 7-bit table (for example the lowercase 'a' with acute: 'á').

Others vary based on the choice and configuration of SMS application, and the length of the message[citation needed].

Since release 8 of the 3GPP 23.038 standard of March 2008, additional characters sets can be accessed through the use of a National Language Shift Tables.

Release 9 introduced 10 languages used in India written with a Brahmic scripts (Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu) and Urdu.

There is still no defined national language shift table for French, Greek, Russian, Bulgarian, Arabic, Hebrew and most Central European languages that need a better coverage than the default 7-bit standard character set and its default 7-bit extension character set: if ever any character is composed that cannot be represented in those default GSM 7-bit sets, the message will be automatically reencoded using UCS-2, with the effect of dividing by more than two the maximum length in characters of messages that can be sent at the price of a single SMS (when a message is split in multiple parts, a few other octets are needed in the User Data Header to indicate the sequence number of each part).

Although a revision of GSM 03.38 (as early as in version 4.0.1 of September 1994) has defined Data Coding Scheme values for Cell Broadcast System (CBS) for German, English, Italian, French, Spanish, Dutch, Swedish, Danish, Finnish, Norwegian, Greek and Turkish; with Hungarian, Polish, Czech, Hebrew, Arabic, Russian and Icelandic added in later revisions, no coding tables were defined for these languages.