PMBC: Pattern mining from biological sequences with wildcard constraints

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
505143	864477	2013	12 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

DNA - DNA یا اسید دزوکسی ریبونوکلئیک biological sequences - توالی بیولوژیکی Data mining - داده‌کاوی Protein - پروتئین Pattern discovery - کشف الگو

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

PMBC: Pattern mining from biological sequences with wildcard constraints

چکیده انگلیسی

Patterns/subsequences frequently appearing in sequences provide essential knowledge for domain experts, such as molecular biologists, to discover rules or patterns hidden behind the data. Due to the inherent complex nature of the biological data, patterns rarely exactly reproduce and repeat themselves, but rather appear with a slightly different form in each of its appearances. A gap constraint (In this paper, a gap constraint (also referred to as a wildcard) is a character that can be substituted for any character predefined in an alphabet.) provides flexibility for users to capture useful patterns even if their appearances vary in the sequences. In order to find patterns, existing tools require users to explicitly specify gap constraints beforehand. In reality, it is often nontrivial or time-consuming for users to provide proper gap constraint values. In addition, a change made to the gap values may give completely different results, and require a separate time-consuming re-mining procedure. Therefore, it is desirable to automatically and efficiently find patterns without involving user-specified gap requirements. In this paper, we study the problem of frequent pattern mining without user-specified gap constraints and propose PMBC (namely P̲atternM̲ining from B̲iological sequences with wildcard C onstraints) to solve the problem. Given a sequence and a support threshold value (i.e. pattern frequency threshold), PMBC intends to discover all subsequences with their support values equal to or greater than the given threshold value. The frequent subsequences then form patterns later on. Two heuristic methods (one-way vs. two-way scans) are proposed to discover frequent subsequences and estimate their frequency in the sequences. Experimental results on both synthetic and real-world DNA sequences demonstrate the performance of both methods for frequent pattern mining and pattern frequency estimation.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computers in Biology and Medicine - Volume 43, Issue 5, 1 June 2013, Pages 481–492

نویسندگان

Xindong Wu, Xingquan Zhu, Yu He, Abdullah N. Arslan,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

PMBC: Pattern mining from biological sequences with wildcard constraints

دسترسی سریع

ارتباط

English Website