Suffix tree

Suffix trees allow particularly fast implementations of many important string operations.

Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.

, Weiner stored in his trie[3] the prefix identifier for each position, that is, the shortest string starting at

Weiner's Algorithm B maintains several auxiliary data structures, to achieve an overall run time linear in the size of the constructed trie.

Weiner's Algorithm C finally uses compressed tries to achieve linear overall storage size and run time.

[5] Donald Knuth subsequently characterized the latter as "Algorithm of the Year 1973" according to his student Vaughan Pratt.

][6] The text book Aho, Hopcroft & Ullman (1974, Sect.9.5) reproduced Weiner's results in a simplified and more elegant form, introducing the term position tree.

is usually longer than the prefix identifier, their path representations in a compressed trie do not differ in size.

On the other hand, McCreight could dispense with most of Weiner's auxiliary data structures; only suffix links remained.

Farach (1997) gave the first suffix tree construction algorithm that is optimal for all alphabets.

In particular, this is the first linear-time algorithm for strings drawn from an alphabet of integers in a polynomial range.

In such a case, the path spelling out bc will not end in a leaf, violating the fifth rule.

is a string (possibly empty), it has a suffix link to the internal node representing

[9] For larger alphabets, the running time is dominated by first sorting the letters to bring them into a range of size

You can: The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in

[17] One can then also: Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology and other application areas.

Variants of the LZW compression schemes use suffix trees (LZSS).

, but each edge can be stored as the position and length of a substring of S, giving a total space usage of

The worst-case space usage of a suffix tree is seen with a fibonacci word, giving the full

An important choice when making a suffix tree implementation is the parent-child relationships between nodes.

The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to 20 times the memory size of the source text in good implementations.

[citation needed] Researchers have continued to find smaller indexing structures.

[32] Though linear, the memory usage of a suffix tree is significantly higher than the actual size of the sequence collection.

The algorithm by Farach-Colton, Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting.

[33] On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few) GB/hours.

TDD and TRELLIS scale up to the entire human genome resulting in a disk-based suffix tree of a size in the tens of gigabytes.

[36] DiGeST performs significantly better and is able to handle collections of sequences in the order of 6 GB in about 6 hours.

The most recent method, B2ST,[37] scales to handle inputs that do not fit in main memory.

ERA is a recent parallel suffix tree construction method that is significantly faster.

ERA can index the entire human genome in 19 minutes on an 8-core desktop computer with 16 GB RAM.

Suffix tree for the text BANANA . Each substring is terminated with special character $ . The six paths from the root to the leaves (shown as boxes) correspond to the six suffixes A$ , NA$ , ANA$ , NANA$ , ANANA$ and BANANA$ . The numbers in the leaves give the start position of the corresponding suffix. Suffix links, drawn dashed, are used during construction.