Efficient tools for comparative substring analysis

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
24216	43505	2010	7 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Suffix tree - درخت سلفی Phylogeny - فیلوژنی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی شیمی بیو مهندسی (مهندسی زیستی)

پیش نمایش صفحه اول مقاله

Efficient tools for comparative substring analysis

چکیده انگلیسی

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K = ∞). Remarkably, although there may be Θ(L2) distinct strings that occur in a given sequence of length L (and Θ(KL) of length k ≤ K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K).A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper
• describes the details of this implementation—an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs,
• indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over- or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.),
• and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Biotechnology - Volume 149, Issue 3, 1 September 2010, Pages 120–126

نویسندگان

Alberto Apostolico, Olgert Denas, Andreas Dress,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Efficient tools for comparative substring analysis

دسترسی سریع

ارتباط

English Website