Article ID Journal Published Year Pages File Type
2816712 Gene 2014 6 Pages PDF
Abstract

•Sequence and structural features were used to identify human lincRNAs.•A feature selection method named GA-SVM was performed to optimize features.•The linc-SF was constructed based on the optimized features for predicting lincRNAs.

Long intergenic non-coding RNAs (lincRNAs) are a new type of non-coding RNAs and are closely related with the occurrence and development of diseases. In previous studies, most lincRNAs have been identified through next-generation sequencing. Because lincRNAs exhibit tissue-specific expression, the reproducibility of lincRNA discovery in different studies is very poor. In this study, not including lincRNA expression, we used the sequence, structural and protein-coding potential features as potential features to construct a classifier that can be used to distinguish lincRNAs from non-lincRNAs. The GA–SVM algorithm was performed to extract the optimized feature subset. Compared with several feature subsets, the five-fold cross validation results showed that this optimized feature subset exhibited the best performance for the identification of human lincRNAs. Moreover, the LincRNA Classifier based on Selected Features (linc-SF) was constructed by support vector machine (SVM) based on the optimized feature subset. The performance of this classifier was further evaluated by predicting lincRNAs from two independent lincRNA sets. Because the recognition rates for the two lincRNA sets were 100% and 99.8%, the linc-SF was found to be effective for the prediction of human lincRNAs.

Related Topics
Life Sciences Biochemistry, Genetics and Molecular Biology Genetics
Authors
, , , , , , , , ,