کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
392630 | 665145 | 2016 | 10 صفحه PDF | دانلود رایگان |

• Makes use of simple list structure instead of suffix trees.
• Algorithm finds all the primitive maximal tandem repeats of all sizes.
• Theoretically the algorithm is linear in space.
• In practice the algorithm is linear in time and independent of the size of alphabet.
• Application in stringology and bioinformatics.
A primitive tandem repeat α is a substring in string S if it can be expressed as two or more contiguous copies of β, where the base β cannot be expressed in terms of yet a shorter substring. Substring α is maximal if there is no copy of β to either its left or right. Tandem repeats (or arrays) are known to play an important role in biology, e.g. determining parentage. We present a deterministic algorithm that finds all the exact primitive maximal tandem arrays that occur in S, where at iteration k it discovers all the repeats with base length k. Our algorithm uses a simple list structure for all its operations. In theory, the algorithm scales well with the size of the alphabet. For strings over the alphabet Σ it has a complexity of O(|S|+|Σ|)O(|S|+|Σ|) in space, and O(B|S|−B2/2)O(B|S|−B2/2) in time, where B is the length of the longest primitive base of a tandem repeat in S. Experimental results on real biological sequences, randomly generated sequences using large sized alphabets, and Fibonacci strings, show that in practice the algorithm has indeed a linear complexity, both in space and time.
Journal: Information Sciences - Volume 345, 1 June 2016, Pages 96–105