Suffix automaton

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings.

is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton.

They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string

The concept of suffix automaton was introduced in 1983[1] by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner,[2] Vaughan Pratt[3] and Anatol Slissenko.

[4] In their initial work, Blumer et al. showed a suffix automaton built for the string

[5] In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm[2] while building a suffix tree of the string

[7] In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction.

[1] In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie.

[8] Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular:[9] Formally, deterministic finite automaton is determined by 5-tuple

It allows to formulate the following lemma defining a bijection between the right context of the word and the set of right positions of its occurrences in

Trie is said to recognize a set of words defined by paths from its root to final vertices.

In this way prefix trees are a special kind of deterministic finite automata if you perceive its root as an initial vertex.

[14] Initially the automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.

[14] Given the correspondence between states of the suffix automaton and vertices of the suffix tree, it is possible to find out the second state that may possibly split after a new character is appended.

[14] Besides suffix links it is also needed to define final states of the automaton.

[19] Theoretical results above lead to the following algorithm that takes character

[19] Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton.

and it is only increased by one between iterations of appending new letters that suggest total complexity is at most linear as well.

[20] Similar transforms are possible in both directions to switch between the suffix automaton of

[18] Other than this several generalizations were developed to construct an automaton for the set of strings given by trie,[8] compacted suffix automation (CDAWG),[7] to maintain the structure of the automaton on the sliding window,[21] and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.

which defines the set of words recognized by the same state of compacted automaton.

, which highlights the fact that a compacted automaton may be obtained by both gluing suffix tree vertices equivalent via

Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings but instead of growing words one by one, it traverses the trie in a breadth-first search order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.

[26] Some compression algorithms, such as LZ77 and RLE may benefit from storing suffix automaton or similar structure not for the whole string but for only last

In 1985, Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size

A variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene;[28] several years later a similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson.

[31] One way to overcome this obstacle is to allow window width to vary a bit while staying

The window for which suffix automaton is built in this algorithm is not guaranteed to be of length

[33] Suffix automata are also used in data compression,[35] music retrieval[36][37] and matching on genome sequences.

Anselm Blumer with a drawing of generalized CDAWG for strings ababc and abcab
Relationship of the suffix trie, suffix tree, DAWG and CDAWG