In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc.
In case of a programming language, the categories include identifiers, operators, grouping symbols and data types.
However, lexers can sometimes include some complexity, such as phrase structure processing to make input easier and simplify the parser, and may be written partly or fully by hand, either to support more features or for performance.
Lexical tokenization is the conversion of a raw text into (semantically or syntactically) meaningful lexical tokens, belonging to categories defined by a "lexer" program, such as identifiers, operators, grouping symbols, and data types.
The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression /\s{1}/).
The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree.
For example, a typical lexical analyzer recognizes parentheses as tokens but does nothing to ensure that each "(" is matched with a ")".
Tokens are often defined by regular expressions, which are understood by a lexical analyzer generator such as lex, or handcoded equivalent finite-state automata.
In some languages, the lexeme creation rules are more complex and may involve backtracking over previously read characters.
In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value.
For example, in the source code of a computer program, the string might be converted into the following lexical token stream; whitespace is suppressed and special characters have no value: Lexers may be written by hand.
These tools generally accept regular expressions that describe the tokens allowed in the input stream.
However, even here there are many edge cases such as contractions, hyphenated words, emoticons, and larger constructs such as URIs (which for some purposes may count as single tokens).
Tokenization is particularly difficult for languages written in scriptio continua, which exhibit no word boundaries, such as Ancient Greek, Chinese,[4] or Thai.
Some ways to address the more difficult problems include developing more complex heuristics, querying a table of common special cases, or fitting the tokens to a language model that identifies collocations in a later processing step.
These generators are a form of domain-specific language, taking in a lexical specification – generally regular expressions with some markup – and emitting a lexer.
[dubious – discuss] With the latter approach the generator produces an engine that directly jumps to follow-up states via goto statements.
[citation needed] It is in general difficult to hand-write analyzers that perform better than engines generated by these latter tools.
In these cases, semicolons are part of the formal phrase grammar of the language, but may not be found in input text, as they can be inserted by the lexer.
Generally lexical grammars are context-free, or almost so, and thus require no looking back or ahead, or backtracking, which allows a simple, clean, and efficient implementation.
Simple examples include semicolon insertion in Go, which requires looking back one token; concatenation of consecutive string literals in Python,[7] which requires holding one token in a buffer before emitting it (to see if the next token is another string literal); and the off-side rule in Python, which requires maintaining a count of indent level (indeed, a stack of each indent level).
These examples all only require lexical context, and while they complicate a lexer somewhat, they are invisible to the parser and later phases.
A more complex example is the lexer hack in C, where the token class of a sequence of characters cannot be determined until the semantic analysis phase since typedef names and variable names are lexically identical but constitute different token classes.
Thus in the hack, the lexer calls the semantic analyzer (say, symbol table) and checks if the sequence requires a typedef name.