Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
434357 | Theoretical Computer Science | 2014 | 8 Pages |
A tandem repeat in a string is a sequence of two or more contiguous, approximate copies of a pattern. Finding an efficient, deterministic algorithm to perform an exhaustive search for tandem repeats in a string is an important and practical problem, due to the relevance of tandem repeats in several areas, including human identity testing, disease diagnosis, sequence homology, and population studies.In this paper, we present an O(nklog2k+Occ) algorithm to find all approximate tandem repeats within a sequence of length n, allowing at most k insertions, deletions and mismatches in each repeat. Our algorithm utilizes the Lempel–Ziv factorization which was previously used in algorithms that locate exact tandem repeats, and algorithms that locate tandem repeats with only mismatches. The LZ framework is combined with speedups for calculating the edit distance, achieving a new and efficient exhaustive search for finding tandem repeats in a string.