Linguistic sequence complexity

Subsequent work improved the original algorithm described in Trifonov (1990),[1] without changing the essence of the linguistic complexity approach.

Complexity (C) of a sequence fragment (with a length RW) can be directly calculated as the product of vocabulary-usage measures (Ui):[2]

[2] This formula is different from the original LC measure[1] in two respects: in the way vocabulary usage Ui is calculated, and because i is not in the range of 2 to N-1 but only up to W. This limitation on the range of Ui makes the algorithm substantially more efficient without loss of power.

[2] In [5] [clarification needed] was used another modified version, wherein linguistic complexity (LC) is defined as the ratio of the number of substrings of any length present in the string to the maximum possible number of substrings.

Maximum vocabulary over word sizes 1 to m can be calculated according to the simple formula .

[5] This sequence analysis complexity calculation can be used to search for conserved regions between compared sequences for the detection of low-complexity regions including simple sequence repeats, imperfect direct or inverted repeats, polypurine and polypyrimidine triple-stranded DNA structures, and four-stranded structures (such as G-quadruplexes).