Syntax (programming languages)

[3] Second, the parser turns the linear sequence of tokens into a hierarchical syntax tree; this is known as "parsing" narrowly speaking.

This ensures that the line of tokens conform to the formal grammars of the programming language.

The AST and contextual analysis steps can be considered a form of semantic analysis, as they are adding meaning and interpretation to the syntax, or alternatively as informal, manual implementations of syntactical rules that would be difficult or awkward to describe or implement formally.

Phrases are in a context-free language (CFL), generally a deterministic context-free language (DCFL), specified in a phrase structure grammar, which is a Type-2 grammar, generally given as production rules in Backus–Naur form (BNF).

In principle, contextual structure can be described by a context-sensitive grammar, and automatically analyzed by means such as attribute grammars, though, in general, this step is done manually, via name resolution rules and type checking, and implemented via a symbol table which stores names and types for each scope.

As an example, (add 1 1) is a syntactically valid Lisp program (assuming the 'add' function exists, else name resolution fails), adding 1 and 1.

Type errors of this kind can be detected at compile-time: They can be detected during parsing (phrase analysis) if the compiler uses separate rules that allow "integerLiteral + integerLiteral" but not "stringLiteral + integerLiteral", though it is more likely that the compiler will use a parsing rule that allows all expressions of the form "LiteralOrIdentifier + LiteralOrIdentifier" and then the error will be detected during contextual analysis (when type checking occurs).

The syntax of textual programming languages is usually defined using a combination of regular expressions (for lexical structure) and Backus–Naur form (a metalanguage for grammatical structure) to inductively specify syntactic categories (nonterminal) and terminal symbols.

Different but equivalent phrase grammars yield different parse trees, though the underlying language (set of valid documents) is the same.

Below is a simple grammar, defined using the notation of regular expressions and Extended Backus–Naur form.

It describes the syntax of S-expressions, a data syntax of the programming language Lisp, which defines productions for the syntactic categories expression, atom, number, symbol, and list: This grammar specifies the following: Here the decimal digits, upper- and lower-case characters, and parentheses are terminal symbols.

For example, in Perl it is possible to execute code during parsing using a BEGIN statement, and Perl function prototypes may alter the syntactic interpretation, and possibly even the syntactic validity of the remaining code.

The meaning given to a combination of symbols is handled by semantics (either formal or hard-coded in a reference implementation).

Using natural language as an example, it may not be possible to assign a meaning to a grammatically correct sentence or the sentence may be false: The following C language fragment is syntactically correct, but performs an operation that is not semantically defined (because p is a null pointer, the operations p->real and p->im have no meaning): As a simpler example, is syntactically valid, but not semantically defined, as it uses an uninitialized variable.