کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4495825 1623812 2016 9 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses
موضوعات مرتبط
علوم زیستی و بیوفناوری علوم کشاورزی و بیولوژیک علوم کشاورزی و بیولوژیک (عمومی)
پیش نمایش صفحه اول مقاله
An ensemble distance measure of k-mer and Natural Vector for the phylogenetic analysis of multiple-segmented viruses
چکیده انگلیسی


• We tested six distance combination methods which are true metrics. Using the reference viral genomes to evaluate these combinations of the Natural Vector (NV) and k-mer. the Euclidean type combination of Natural Vector and 5-mer has the best accuracy.
• We applied the range of k used in Sims et al. (2009) and Wen et al. (2014) and leave-one-out cross validation to find the k-mer size. If we do not have known classification label for cross validation, we can choose the k corresponding to the relatively stable topology.
• Testing the proposed method on the 48 influenza A viruses, the resulting single-linkage tree of the proposed method has more reasonable taxonomic relationships than the trees of using NV or 5-mer only.
• Using the Kruskal-Wallis test, Wilcoxon rank sum test, and Bartlett test to show that the proposed ensemble distance measure can separate the new H7N9 and old avian H7N9 well.

The Natural Vector combined with Hausdorff distance has been successfully applied for classifying and clustering multiple-segmented viruses. Additionally, k-mer methods also yield promising results for global genome comparison. It is not known whether combining these two approaches can lead to more accurate results. The author proposes a method of combining the Hausdorff distances of the 5-mer counting vectors and natural vectors which achieves the best classification without cutting off any sample. Using the proposed method to predict the taxonomic labels for the 2363 NCBI reference viral genomes dataset, the accuracy rates are 96.95%, 94.37%, 99.41% and 93.82% for the Baltimore, family, subfamily, and genus labels, respectively. We further applied the proposed method to 48 isolates of the influenza A H7N9 viruses which have eight complete segments of nucleotide sequences. The single-linkage clustering trees and the statistical hypothesis testing results all indicate that the proposed ensemble distance measure can cluster viruses well using all of their segments of genome sequences.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Theoretical Biology - Volume 398, 7 June 2016, Pages 136–144
نویسندگان
,