کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
417987 681597 2016 30 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Mining approximate patterns with frequent locally optimal occurrences
ترجمه فارسی عنوان
یافتن الگوهای تقریبی با وقایع مکرر به طور محلی بهینه
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
چکیده انگلیسی

We consider a frequent approximate pattern mining problem, in which interspersed repetitive regions are extracted from a given string. That is, we enumerate substrings that frequently match substrings of a given string locally and optimally. For this problem, we propose a new algorithm, in which candidate patterns are generated without duplication using the suffix tree of a given string. We further define a kk-gap-constrained setting, in which the number of gaps in the alignment between a pattern and an occurrence is limited to at most kk. Under this setting, we present memory-efficient algorithms, particularly a candidate-based version, which runs fast enough even over human chromosome sequences with more than 10 million nucleotides. We note that our problem and algorithms for strings can be directly extended to ordered labeled trees. In our experiments we used both randomly synthesized strings, in which corrupted similar substrings are embedded, and real data of human chromosome. The synthetic data experiments show that our proposed approach extracted embedded patterns correctly and time-efficiently. In real data experiments, we examined the centers of 100 clusters computed after grouping the patterns obtained by our kk-gap-constrained versions (k=0,1k=0,1 and 2) and the results revealed that the regions of their occurrences coincided with around a half of the regions automatically annotated as Alu sequences by a manually curated repeat sequence database.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Discrete Applied Mathematics - Volume 200, 19 February 2016, Pages 123–152
نویسندگان
, , , , ,