Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines

Article ID	Journal	Published Year	Pages	File Type
4499588	Journal of Theoretical Biology	2006	10 Pages	PDF

Abstract

In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function.In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was <25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are ∼84%, ∼78%, ∼72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request.

Keywords

Support vector machines (SVMs)DNA-binding protein RNA-binding protein Protein function prediction