Spaced seed

In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.

They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest.

Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.

Due to a number of functional and evolutionary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6% (or around 20 million base pairs).

1111001111 Upon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions (in bold, above).

In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified.

could be used to effectively zero-weighting the mismatch sites, treating the sequences as same for the purposes of hit identification.

[6] In 1995, a similar concept was used in approximate string matching where "gapped tuples" of positions in a sequence were explored to identify common substrings between a large text and a query.

[7] The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring[8] and soon after in 2002, PatternHunter introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST,[1] which was ultimately adopted by newer versions of it.

Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunter[9] Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches.

[10] Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit.

The type of seed model used for sequence alignment can affect the processing time and memory usage when doing large-scale homology searches – two considerations that have been central in the development of modern homology search algorithms.

[11][2][12][13] Ideally, this first step would find all relevant locations in the target so sensitivity is prioritized but due to computational intensity, many popular algorithms (such as the earlier implementations of BLAST and FASTA) use heuristics to "short-cut" exploring all locations, ultimately missing many but running relatively quickly.

A variation of spaced seeds with a single contiguous gap has been used in de novo sequence assembly.

Thus, to circumvent the problem with memory usage, the less-important middle part (covered by the gap) is ignored.

This approach has the additional advantage, as in other uses of spaced seeds, of taking into account any sequencing errors that may have occurred in the gap area.

PatternHunter II, in 2003, demonstrated that this approach could offer higher sensitivity than BLAST while maintaining similar speed.