Predicting protein–RNA interaction amino acids using random forest based on submodularity subset selection

Article ID	Journal	Published Year	Pages	File Type
15095	Computational Biology and Chemistry	2014	7 Pages	PDF

Abstract

•We proposed a computational method for protein–RNA binding sites prediction by combining local features and global features from protein sequence based on submodularity subset selection.•It achieved better performance than other state-of-the-art methods.•It indicated that extracted global features have very strong discriminate ability for identifying interaction sites.

Protein–RNA interaction plays a very crucial role in many biological processes, such as protein synthesis, transcription and post-transcription of gene expression and pathogenesis of disease. Especially RNAs always function through binding to proteins. Identification of binding interface region is especially useful for cellular pathways analysis and drug design. In this study, we proposed a novel approach for binding sites identification in proteins, which not only integrates local features and global features from protein sequence directly, but also constructed a balanced training dataset using sub-sampling based on submodularity subset selection. Firstly we extracted local features and global features from protein sequence, such as evolution information and molecule weight. Secondly, the number of non-interaction sites is much more than interaction sites, which leads to a sample imbalance problem, and hence biased machine learning model with preference to non-interaction sites. To better resolve this problem, instead of previous randomly sub-sampling over-represented non-interaction sites, a novel sampling approach based on submodularity subset selection was employed, which can select more representative data subset. Finally random forest were trained on optimally selected training subsets to predict interaction sites. Our result showed that our proposed method is very promising for predicting protein–RNA interaction residues, it achieved an accuracy of 0.863, which is better than other state-of-the-art methods. Furthermore, it also indicated the extracted global features have very strong discriminate ability for identifying interaction residues from random forest feature importance analysis.

Graphical abstractFigure optionsDownload full-size imageDownload as PowerPoint slide

Keywords

Random forest