کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
24216 43505 2010 7 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Efficient tools for comparative substring analysis
موضوعات مرتبط
مهندسی و علوم پایه مهندسی شیمی بیو مهندسی (مهندسی زیستی)
پیش نمایش صفحه اول مقاله
Efficient tools for comparative substring analysis
چکیده انگلیسی

This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K = ∞). Remarkably, although there may be Θ(L2) distinct strings that occur in a given sequence of length L (and Θ(KL) of length k ≤ K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K).A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper
• describes the details of this implementation—an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs,
• indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over- or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.),
• and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Biotechnology - Volume 149, Issue 3, 1 September 2010, Pages 120–126
نویسندگان
, , ,