Parsing

In this context, parsing refers to the way that human beings analyze a sentence or phrase (in spoken language or text) "in terms of grammatical constituents, identifying the parts of speech, syntactic relations, etc.

Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language.

[citation needed] In order to parse natural language data, researchers must first agree on the grammar to be used.

Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand).

This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts.

Approaches which have been used include straightforward PCFGs (probabilistic context-free grammars),[6] maximum entropy,[7] and neural nets.

[8] Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech).

A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.

[citation needed] In natural language understanding applications, semantic parsers convert the text into a representation of its meaning.

Another type of sentence that is difficult to parse is an attachment ambiguity, which includes a phrase that could potentially modify different parts of a sentence, and therefore presents a challenge in identifying syntactic relationship (i.e. "The boy saw the lady with the telescope", in which the ambiguous phrase with the telescope could modify the boy saw or the lady.)

Sentences with 2 or in the most extreme cases 3 center embeddings are challenging for mental parsing, again because of ambiguity of syntactic relationship.

[14] The opposing, more contemporary model theorizes that within the mind, the processing of a sentence is not modular, or happening in strict sequence.

The parser is often preceded by a separate lexical analyser, which creates tokens from the sequence of input characters; alternatively, these can be combined in scannerless parsing.

In the case of programming languages, a parser is a component of a compiler or interpreter, which parses the source code of a computer programming language to create some form of internal representation; the parser is a key step in the compiler frontend.

The implied disadvantages of a one-pass compiler can largely be overcome by adding fix-ups, where provision is made for code relocation during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed.

The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions.

For example, a calculator program would look at an input such as "12 * (3 + 4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, 2, each of which is a meaningful symbol in the context of an arithmetic expression.

This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear.

However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers.

The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action.

[17] In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code.

The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar.

[24] Adaptive parsing algorithms have been used to construct "self-extending" natural language user interfaces.

Initially Input = [1, +, 2, *, 3] The parse tree and resulting code from it is not correct according to language semantics.

Flow of data in a typical parser
Flow of data in a typical parser
C program that cannot be parsed with less than 2 token lookahead. Top: C grammar excerpt. [ 28 ] Bottom: a parser has digested the tokens " int v ; main (){ " and is about to choose a rule to derive Stmt . Looking only at the first lookahead token " v ", it cannot decide which of both alternatives for Stmt to choose; the latter requires peeking at the second token.