Planted motif search

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

The time complexities of most of the planted motif search algorithms depend exponentially on the alphabet size and l. The PMS problem was first introduced by Keich and Pevzner.

[1] The problem of identifying meaningful patterns (e.g., motifs) from biological data has been studied extensively since they play a vital role in understanding gene function, human disease, and may serve as therapeutic drug targets.

For example, if the input strings are GCGCGAT, CACGTGA, and CGGTGCC; l = 3 and d = 1, then GGT is a motif of interest.

For example, GAT is an instance of the motif GGT that occurs in the first input string.

Assume that S = {s1, s2, s3, ..., sn} is the given set of input strings from an alphabet Σ.

The scientific literature describes numerous algorithms for solving the PMS problem.

Examples of approximation (or heuristic) algorithms include Random Projection,[2] PatternBranching,[3] MULTIPROFILER,[1] CONSENSUS,[4] and ProfileBranching.

The algorithm projects these l-mers along k randomly chosen positions (for some appropriate value of k).

The key idea of this algorithm is to examine those buckets that have a large number of l-mers in them.

If A and B are two instances of the same motif in two different input strings, then the Hamming distance between A and B is at most 2d.

Two nodes u and v in G are connected by an edge if and only if the Hamming distance between u and v is at most 2d and they come from two different input strings.

(Bailey and Elkan, 1994)[18] employs expectation maximization algorithms while Gibbs sampling is used by (Lawrence et al., 1993).

In the last decade a series of algorithms with PMS as a prefix has been developed in the lab of Rajasekaran.

Let s1, s2, ..., sn be a given set of input strings each of length m. Let C be the collection of l-mers in s1.

In the first phase find all the (k, d) motifs present in all the input strings (for some appropriate value of k

In the second phase, look for (l-k+1) of these (k, d) motifs that occur starting from successive positions in each of the input strings.

PMSprune follows the same strategy as PMS0: For every l-mer y in s1, it generates the set of neighbors of y and, for each of them, checks whether this is a motif or not.

One of the key steps in the algorithm is a subroutine to compute the common d-neighborhood of three l-mers.

is formulated as an integer linear program (ILP) on ten variables.

Solving the ILP instances is done as a preprocessing step and the results are stored in a lookup table.

Algorithm PMS6[24] is an extension of PMS5 that improves the preprocessing step and also it uses efficient hashing techniques to store the lookup tables.

Shibdas Bandyopadhyay, Sartaj Sahni, Sanguthevar Rajasekaran, "PMS6: A fast algorithm for motif discovery," iccabs, pp.

[21] qPMSPrune exploits the following fact: If M is any (l, d, q)-motif of the input strings s1, s2, ..., sn, then there exists an i (with 1 ≤ i ≤ n – q + 1) and an l-mer

Specifically, it is based on the following observation: If M is any (l, d, q)-motif of the input strings s1, s2, ..., sn, then there exist 1 ≤ i ≠ j ≤ n and l-mer

PMS algorithms are typically tested on random benchmark data generated as follows: Twenty strings each of length 600 are generated randomly from the alphabet of interest.

For a given value of l, the instance (l, d) is called challenging if d is the smallest integer for which the expected number of (l, d)-motifs that occur by random chance (in addition to the planted one) is one or more.

The performance of PMS algorithms is customarily shown only for challenging instances.

Following is a table of time comparison of different PMS algorithms on the challenging instances of DNA sequences for the special case.

[25] In this table several algorithms have been compared: qPMSPrune,[21] qPMSPruneI,[25] Pampa,[26] Voting,[15] RISOTTO,[16] PMS5,[23] PMS6,[24] qPMS7.