Alphabet (formal languages)

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/characters/glyphs,[1] typically thought of as representing letters, characters, digits, phonemes, or even words.

[2][3] Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics.

[4] For example, the alphabet of lowercase letters "a" through "z" can be used to form English words like "iceberg" while the alphabet of both upper and lower case letters can also be used to form proper names like "Wikipedia".

Infinite sequences of symbols may be considered as well (see Omega language).

It is often necessary for practical purposes to restrict the symbols in an alphabet so that they are unambiguous when interpreted.

For instance, if the two-member alphabet is {00,0}, a string written on paper as "000" is ambiguous because it is unclear if it is a sequence of three "0" symbols, a "00" followed by a "0", or a "0" followed by a "00".

of all finite strings (regardless of their length) is indicated by the Kleene star operator as

For example, using the binary alphabet {0,1}, the strings ε, 0, 1, 00, 01, 10, 11, 000, etc.

are all in the Kleene closure of the alphabet (where ε represents the empty string).

Alphabets are important in the use of formal languages, automata and semiautomata.

In most cases, for defining instances of automata, such as deterministic finite automata (DFAs), it is required to specify an alphabet from which the input strings for the automaton are built.

In these applications, an alphabet is usually required to be a finite set, but is not otherwise restricted.

When using automata, regular expressions, or formal grammars as part of string-processing algorithms, the alphabet may be assumed to be the character set of the text to be processed by these algorithms, or a subset of allowable characters from the character set.