Article ID Journal Published Year Pages File Type
4948278 Neurocomputing 2016 27 Pages PDF
Abstract
To better understand the functions of proteins, it is a critical step to predict their subcellular locations. Recently, numerous computational methods have been developed for protein subcellular localization prediction. Most of existing methods rely on the Gene Ontology (GO) information for feature representation. Although the GO information is proved to be beneficial for the improved predictive performance of the methods in prior research, the following problem is that it generates a super-high dimensional feature space, and the dimension of the feature space will get higher and higher as the number of the terms in the GO database increase. To address this issue, we propose a novel feature representation method sufficiently exploring the sequence evolutional information rather than using the GO information. Using the proposed feature representation method, we generate a comprehensive feature set of 828 features from the following three aspects: physicochemical properties, position-specific score matrix (PSSM), and the k-skip-n-gram model. By featuring a multi-label ensemble classifier with the proposed features, we further develop a novel multi-label learning method, namely mGOF-loc. Results on an updated large-scale dataset distributed with 37 subcellular locations show that mGOF-loc outperforms existing methods. Currently, a webserver that implements mGOF-loc is freely available on http://server.malab.cn/mGOF-loc/Index.html.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , , , ,