Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
392630 | Information Sciences | 2016 | 10 Pages |
•Makes use of simple list structure instead of suffix trees.•Algorithm finds all the primitive maximal tandem repeats of all sizes.•Theoretically the algorithm is linear in space.•In practice the algorithm is linear in time and independent of the size of alphabet.•Application in stringology and bioinformatics.
A primitive tandem repeat α is a substring in string S if it can be expressed as two or more contiguous copies of β, where the base β cannot be expressed in terms of yet a shorter substring. Substring α is maximal if there is no copy of β to either its left or right. Tandem repeats (or arrays) are known to play an important role in biology, e.g. determining parentage. We present a deterministic algorithm that finds all the exact primitive maximal tandem arrays that occur in S, where at iteration k it discovers all the repeats with base length k. Our algorithm uses a simple list structure for all its operations. In theory, the algorithm scales well with the size of the alphabet. For strings over the alphabet Σ it has a complexity of O(|S|+|Σ|)O(|S|+|Σ|) in space, and O(B|S|−B2/2)O(B|S|−B2/2) in time, where B is the length of the longest primitive base of a tandem repeat in S. Experimental results on real biological sequences, randomly generated sequences using large sized alphabets, and Fibonacci strings, show that in practice the algorithm has indeed a linear complexity, both in space and time.