Probabilistic context-free grammar

In RNA secondary structure prediction variants of the Cocke–Younger–Kasami (CYK) algorithm provide more efficient alternatives to grammar parsing than pushdown automata.

The Inside-Outside algorithm is used in model parametrization to estimate prior frequencies observed from training sequences in the case of RNAs.

Dynamic programming variants of the CYK algorithm find the Viterbi parse of a RNA sequence for a PCFG model.

Context-free grammars are represented as a set of rules inspired from attempts to model natural languages.

Probabilistic grammars circumvent these problems by ranking various productions on frequency weights, resulting in a "most likely" (winner-take-all) interpretation.

As usage patterns are altered in diachronic shifts, these probabilistic rules can be re-learned, thus updating the grammar.

CFGs when contrasted with PCFGs are not applicable to RNA structure prediction because while they incorporate sequence-structure relationship they lack the scoring metrics that reveal a sequence structural potential [6] A weighted context-free grammar (WCFG) is a more general category of context-free grammar, where each production has a numeric weight associated with it.

[11][12][13][14][15] Energy minimization[16][17] and PCFG provide ways of predicting RNA secondary structure with comparable performance.

PCFG model parameters are directly derived from frequencies of different features observed in databases of RNA structures [6] rather than by experimental determination as is the case with energy minimization methods.

Therefore, building grammar to model the behavior of base-pairs and single-stranded regions starts with exploring features of structural multiple sequence alignment of related RNAs.

[1] The above grammar generates a string in an outside-in fashion, that is the basepair on the furthest extremes of the terminal is derived first.

is derived by first generating the distal a's on both sides before moving inwards: A PCFG model extendibility allows constraining structure prediction by incorporating expectations about different features of an RNA .

[6] However incorporation of too much information may increase PCFG space and memory complexity and it is desirable that a PCFG-based model be as simple as possible.

For example, Pfold is used in secondary structure prediction from a group of related RNA sequences,[20] covariance models are used in searching databases for homologous sequences and RNA annotation and classification,[11][24] RNApromo, CMFinder and TEISER are used in finding stable structural motifs in RNAs.

Each feature to be modeled has a production rule that is assigned a probability estimated from a training set of RNA structures.

In addition, the PCFG itself can be incorporated into probabilistic models that consider RNA evolutionary history or search homologous sequences in databases.

[21] A summary of general steps for utilizing PCFGs in various scenarios: Several algorithms dealing with aspects of PCFG based probabilistic models in RNA structure prediction exist.

It is possible to reestimate the PCFG algorithm by finding the expected number of times a state is used in a derivation through summing all the products of α and β divided by the probability for a sequence x given the model

It is also possible to find the expected number of times a production rule is used by an expectation-maximization that utilizes the values of α and β.

Covariance models (CMs) are a special type of PCFGs with applications in database searches for homologs, annotation and RNA classification.

Through CMs it is possible to build PCFG-based RNA profiles where related RNAs can be represented by a consensus secondary structure.

Since these scores are a function of sequence length a more discriminative measure to recover an optimum parse tree probability score-

[1] The KH-99 algorithm by Knudsen and Hein lays the basis of the Pfold approach to predicting RNA secondary structure.

[20] In this approach the parameterization requires evolutionary history information derived from an alignment tree in addition to probabilities of columns and mutations.

[20] For instance: Given the prior alignment frequencies of the data the most likely structure from the ensemble predicted by the grammar can then be computed by maximizing

[20] Whereas PCFGs have proved powerful tools for predicting RNA secondary structure, usage in the field of protein sequence analysis has been limited.

Indeed, the size of the amino acid alphabet and the variety of interactions seen in proteins make grammar inference much more challenging.

[39] As a consequence, most applications of formal language theory to protein analysis have been mainly restricted to the production of grammars of lower expressive power to model simple functional patterns based on local interactions.

[40][41] Since protein structures commonly display higher-order dependencies including nested and crossing relationships, they clearly exceed the capabilities of any CFG.

[39] Still, development of PCFGs allows expressing some of those dependencies and providing the ability to model a wider range of protein patterns.