Article ID Journal Published Year Pages File Type
425340 Future Generation Computer Systems 2007 18 Pages PDF
Abstract

Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed, neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion of knowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousands of protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons, pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projection of sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practical solution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in which Z-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary model raising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of the phylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees (TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projects based on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive protein sequence comparison projects.

Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , , ,