Suffix array

It is a data structure used in, among others, full-text indices, data-compression algorithms, and the field of bibliometrics.

additional space beyond the input string and the output suffix array.

is now defined to be an array of integers providing the starting positions of suffixes of

These strings are sorted (as in a paper dictionary), before their starting positions (integer indices) are saved in

=banana$ to be indexed: The text ends with the special sentinel letter $ that is unique and lexicographically smaller than any other character.

[1] Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms (e.g., compared to Ukkonen's algorithm) and improved cache locality.

the suffix array would therefore occupy about 16 times more memory than the genome itself.

and can be converted into a suffix array by traversing the tree depth-first also in

A naive approach to construct a suffix array is to use a comparison-based sorting algorithm.

More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related to each other.

The algorithm is also rather simple (< 100 LOC) and can be enhanced to simultaneously construct the LCP array.

A careful implementation by Yuta Mori[8] outperforms most other linear or super-linear construction approaches.

It runs in linear time and has successfully been used as the basis for parallel[10] and external memory[11] suffix array construction algorithms.

Recent work by Salson et al. (2010) proposes an algorithm for updating the suffix array of a text that has been edited instead of rebuilding a new suffix array from scratch.

, it appears to perform well in practice: experimental results from the authors showed that their implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion of a reasonable number of letters in the original text.

In practical open source work, a commonly used routine for suffix array construction was qsufsort, based on the 1999 Larsson-Sadakane algorithm.

[12] This routine has been superseded by Yuta Mori's DivSufSort, "the fastest known suffix sorting algorithm in main memory" as of 2017.

[16] The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern

time, given that a single suffix comparison needs to compare

Abouelhoda, Kurtz & Ohlebusch (2004) improve the bound even further and achieve a search time of

If this string ends in a special end-of-string character that is lexicographically smaller than all other character (i.e., $), then the order of the sorted rotated BWT matrix corresponds to the order of suffixes in a suffix array.

Suffix trees are powerful data structures that have wide application in areas of pattern and string matching, indexing and textual statistics.

However, it occupies a significant amount of space and thus has a drawback in many real-time applications that require processing a considerably large amount of data like genome analysis.

The node branching data structure for this tree is a linked list.

Enhanced suffix arrays are superior in terms of both space efficiency and time complexity and are easy to implement.

The time complexity for searching a pattern in an enhanced suffix array is O(m|Σ|).

The suffix array is composed of two arrays: For a suffix array of S, the lcp-interval associated with the corresponding node of suffix tree of S can be defined as: Interval [i,..j], 0 ≤ i ≤ j ≤ n is an lcp-interval of lcp-value, if 1. lcptab[i] < l, 2. lcptab[k] ≥ l for all i + 1 ≤ k ≤ j, 3. lcptab[k] = l for some i + 1 ≤ k ≤ j if i ≠ j and l = n − i + 1 if i = j, 4. lcptab[j + 1] < l. The length of the longest common prefix of pos[i − 1] and pos[i] is stored in lcp[i],where 2 ≤ i ≤ n. The lcp-interval portrays the same parent-child relationship as that among the associated nodes in the suffix tree of S.This shows that if the corresponding node of [i..j] is a child of the corresponding node of [k..l], a lcp-interval [i..j] is a child interval of another lcp-interval [k..l].

The child table cldtab is composed of three n arrays, up, down and nextlIndex.

The information about the edges of the corresponding suffix tree is stored and maintained by the up and down arrays.

The up, down and nextlIndex array are defined as follows: By performing a bottom-up traversal of the lcp-interval of the tree, the child table can be constructed in linear time.