Parsing expression grammar

Unlike CFGs, PEGs cannot be ambiguous; a string has exactly one valid parse tree or none.

Examples of this include: In the concrete syntax, quoted and bracketed terminals have backslash escapes, so that "line feed or carriage return" may be written [\n\r].

The primary concrete syntax assigns no distinct meaning to terminals depending on whether they use single or double quotes, but some dialects treat one as case-sensitive and the other as case-insensitive.

This is a PEG that recognizes mathematical formulas that apply the basic five operations to non-negative integers.

Note that rules Sum and Product don't lead to desired left-associativity of these operations (they don't deal with associativity at all, and it has to be handled in post-processing step after parsing), and the Power rule (by referring to itself on the right) results in desired right-associativity of exponent.

Ordered choice is analogous to soft cut operators available in some logic programming languages.

By carefully choosing the order in which the grammar alternatives are specified, a programmer has a great deal of control over which parse tree is selected.

Because they can use an arbitrarily complex sub-expression to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility, in particular when reordering the alternatives cannot specify the exact parse tree desired.

An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input.

An atomic parsing expression consisting of a nonterminal A represents a recursive call to the nonterminal-function A.

Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.

Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking.

The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator.

[5] Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.

Parsing in reverse order solves the left recursion problem, allowing left-recursive rules to be used directly in the grammar without being rewritten into non-left-recursive form, and also confers optimal error recovery capabilities upon the parser, which historically proved difficult to achieve for recursive descent parsers.

Many parsing algorithms require a preprocessing step where the grammar is first compiled into an opaque executable form, often some sort of automaton.

Parsing expressions can be executed directly (even if it is typically still advisable to transform the human-readable PEGs shown in this article into a more native format, such as S-expressions, before evaluating them).

Compared to pure regular expressions (i.e., describing a language recognisable using a finite automaton), PEGs are vastly more powerful.

In practice that is no problem — for example a dynamically sized hash table attains this – but that makes use of pointer arithmetic, so it presumes having a random-access machine.

Theoretical discussions of data structures and algorithms have an unspoken tendency to presume a more restricted model (possibly that of lambda calculus, possibly that of Scheme), where a sparse table rather has to be built using trees, and data item access is not constant time.

Viewed the other way around, this says packrat parsers tap into computational power readily available in real life systems, that older parsing algorithms do not understand to employ.

For a left-to-right top-down parser, such rules cause infinite regress: parsing will continually expand the same nonterminal without moving forward in the string.

Because the term appears in the leftmost position, these rules make up a circular definition that cannot be resolved.

Only the OMeta parsing algorithm[16] supports full direct and indirect left recursion without additional attendant complexity (but again, at a loss of the linear time complexity), whereas all GLR parsers support left recursion.

It is instructive to work out exactly what a PEG parser does when attempting to match against the string xxxxxq.

LL(k) and LR(k) parser generators will fail to complete when the input grammar is ambiguous.

A PEG parser generator will resolve unintended ambiguities earliest-match-first, which may be arbitrary and lead to surprising parses.

The ordering of productions in a PEG grammar affects not only the resolution of ambiguity, but also the language matched.

⁠): Ford notes that The second alternative in the latter PEG rule will never succeed because the first choice is always taken if the input string ... begins with 'a'..[1] Specifically, ⁠

⁠ can be constructed It is an open problem to give a concrete example of a context-free language which cannot be recognized by a parsing expression grammar.