Stemloc attempts to simultaneously predict and align the structure of RNA sequences with an improved time and space cost compared to previous methods with the same motive.
The resulting software implements constrained versions of the Sankoff algorithm by introducing both fold and alignment constraints, which reduces processor and memory usage and allows for larger RNA sequences to be analyzed on commodity hardware.
A previously developed algorithm by David Sankoff in 1985 uses dynamic programming to simultaneously align and predict multiple RNA structures.
This is observantly expensive, and thus is the motivation to create better RNA analysis tools like Stemloc.
The initial goal of Stemloc was to reduce the time and space cost of simultaneous alignment and structure prediction of two RNA sequences by using a stochastic context-free grammar (SCFG) scoring scheme and by implementing constrained versions of the Sankoff Algorithm.
Fold envelopes can be used to "prune" the search over secondary structures and determine the subsequences of two RNA sequences that can be considered in the algorithm.
Sample output shown below: Stemloc relies heavily on stochastic context-free grammars, which can be seen as a scoring scheme for the algorithm.
Because Sankoff's algorithm considers all possible folds and all possible alignments it is quite accurate and thorough, but it takes a measurable amount of time to obtain any results or output.
The main idea of Stemloc is being able to set a threshold for the number of folds and alignments that are sampled to create the envelopes.
(Using a -1 will unlimited the number of folds and alignments sampled, thus using -1 for both parameters will run the Sankoff algorithm on the input dataset.Another feature of Stemloc is its ability to parameterize probabilistic models like stochastic context-free grammars from data.
Stemloc utilizes the Inside-Outside algorithm and stochastic context-free grammars to maximize the likelihood of a training set.
This is useful because the default parameters for Stemloc were trained on a selection of pairwise alignments of between 30% and 40% sequence identity from Rfam (database) version 5.0.