کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
392630 665145 2016 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Identification of tandem repeats over large-alphabet inputs
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Identification of tandem repeats over large-alphabet inputs
چکیده انگلیسی


• Makes use of simple list structure instead of suffix trees.
• Algorithm finds all the primitive maximal tandem repeats of all sizes.
• Theoretically the algorithm is linear in space.
• In practice the algorithm is linear in time and independent of the size of alphabet.
• Application in stringology and bioinformatics.

A primitive tandem repeat α is a substring in string S if it can be expressed as two or more contiguous copies of β, where the base β cannot be expressed in terms of yet a shorter substring. Substring α is maximal if there is no copy of β to either its left or right. Tandem repeats (or arrays) are known to play an important role in biology, e.g. determining parentage. We present a deterministic algorithm that finds all the exact primitive maximal tandem arrays that occur in S, where at iteration k it discovers all the repeats with base length k. Our algorithm uses a simple list structure for all its operations. In theory, the algorithm scales well with the size of the alphabet. For strings over the alphabet Σ   it has a complexity of O(|S|+|Σ|)O(|S|+|Σ|) in space, and O(B|S|−B2/2)O(B|S|−B2/2) in time, where B is the length of the longest primitive base of a tandem repeat in S. Experimental results on real biological sequences, randomly generated sequences using large sized alphabets, and Fibonacci strings, show that in practice the algorithm has indeed a linear complexity, both in space and time.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 345, 1 June 2016, Pages 96–105
نویسندگان
,