Grammar induction

Grammar induction (or grammatical inference)[1] is the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite-state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects.

More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

A more recent textbook is de la Higuera (2010),[1] which covers the theory of grammatical inference of regular languages and finite state automata.

D'Ulizia, Ferri and Grifoni[5] provide a survey that explores grammatical inference methods for natural languages.

[6][7][further explanation needed] The method proposed in Section 8.7 of Duda, Hart & Stork (2001) suggests successively guessing grammar rules (productions) and testing them against positive and negative observations.

The Duda, Hart & Stork (2001) text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious.

Formal grammars can easily be represented as tree structures of production rules that can be subjected to evolutionary operators.

In the case of grammar induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language.

The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language.

These context-free grammar generating algorithms make the decision after every read symbol: These context-free grammar generating algorithms first read the whole given symbol-sequence and then start to make decisions: A more recent approach is based on distributional learning.

Angluin gives a polynomial algorithm to compute, for a given input string set, all descriptive patterns in one variable x.

[note 2] To this end, she builds an automaton representing all possibly relevant patterns; using sophisticated arguments about word lengths, which rely on x being the only variable, the state count can be drastically reduced.

[10] Arimura et al. show that a language class obtained from limited unions of patterns can be learned in polynomial time.

Straight-line grammar (with start symbol ß) for the second sentence of the United States Declaration of Independence . Each blue character denotes a nonterminal symbol ; they were obtained from a gzip -compression of the sentence.