Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure.

The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years.

The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.

It is important to clarify the distinction between the formal representation and the file format used to store the annotated data.

For example, the syntactic analysis for John loves Mary, shown in the figure on the right/above, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation): This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools.

However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base.

[citation needed] A semantic treebank is a collection of natural language sentences annotated with a meaning representation.

An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right).
Example phrase structure tree for John loves Mary
Hybrid constituency/dependency tree from the Quranic Arabic Corpus