Fast embedding methods for clustering tens of thousands of sequences

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
15354	1406	2008	5 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

multiple sequence alignment - ترتیب توالی چندگانه Data embedding - جاسازی اطلاعات Clustering - خوشه بندی edit distance - ویرایش فاصله

موضوعات مرتبط

مهندسی و علوم پایه مهندسی شیمی بیو مهندسی (مهندسی زیستی)

پیش نمایش صفحه اول مقاله

Fast embedding methods for clustering tens of thousands of sequences

چکیده انگلیسی

Most sequence clustering methods require a full distance matrix to be computed between all pairs of sequences. This requires computer memory and time proportional to N2 for N sequences. For small N or say up to 10 000 or so, this can be accomplished in reasonable times for sequences of moderate length. For very large N, however, this becomes increasingly prohibitive. In this paper, we have tested variations on a class of published embedding methods that have been designed for clustering large numbers of complex objects where the individual distance calculations are expensive. These methods involve embedding the sequences in a space where the similarities within a set of sequences can be closely approximated without having to compute all pair-wise distances. We show how this approach greatly reduces computation time and memory requirements for clustering large numbers of sequences and demonstrate the quality of the clusterings by benchmarking them as guide trees for multiple alignments. Source code is available on request from the authors.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computational Biology and Chemistry - Volume 32, Issue 4, August 2008, Pages 282–286

نویسندگان

Gordon Blackshields, Mark Larkin, Iain M. Wallace, Andreas Wilm, Desmond G. Higgins,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Fast embedding methods for clustering tens of thousands of sequences

دسترسی سریع

ارتباط

English Website