The configuration space of homologous proteins: A theoretical and practical framework to reduce the diversity of the protein sequence space after massive all-by-all sequence comparisons

Article ID	Journal	Published Year	Pages	File Type
425340	Future Generation Computer Systems	2007	18 Pages	PDF

Abstract

Most of the millions of virtual protein sequences deduced from genomic DNA, and the millions to come, will not be experimentally confirmed, neither their function directly analyzed. The exploration of the majority of the protein space relies on our ability to extrapolate the portion of knowledge on characterized sequences to unknown sequences. In this paper we analyzed the large scale comparisons of hundreds of thousands of protein sequences that have been previously carried out using the power of supercomputers or grid frameworks. Following these comparisons, pragmatic rules were used to reduce protein diversity, but none was based on a rigorous and robust framework. We examined how projection of sequences in the configuration space of homologous proteins (CSHP) could help in providing a theoretically robust and long-term practical solution to help organize the protein space. The CSHP can be constructed from the output of any all-by-all pair-wise comparison in which Z-values were computed after Monte Carlo simulations. Reduction of protein diversity can be carried out according to an evolutionary model raising consistent phylogenetic clusters. Projection in the CSHP can be easily updated after sequence database updates, and the accuracy of the phylogenetic topology can be upgraded by improving sub-models. Clusters of homologous proteins can be represented as phylogenetic trees (TULIP trees). In this paper, we showed that the CSHP projection can be used to process the outputs of previous massive comparison projects based on Z-value statistics, given minor corrections for uncollected low values and we propose guidelines for future generations of massive protein sequence comparison projects.

Keywords

z-Value Protein sequence comparison Tulip