کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
504827 864440 2016 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Unsupervised learning assisted robust prediction of bioluminescent proteins
ترجمه فارسی عنوان
پیش بینی قوی از پروتئین بیولومینسنت به کمک آموزش بدون نظارت
کلمات کلیدی
عدم تعادل کلاس؛ تمرین مجموعه تنوع؛ توزیع بهینه کلاس ؛ K-Means؛ SMOTE
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
چکیده انگلیسی


• Combination of unsupervised learning with SMOTE for imbalance learning problems.
• Effective handling of between class and within class imbalance.
• Diversification of the training set with optimal class distribution.
• Does not require evolutionary information for prediction.

Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging from cellular imaging to gene expression analysis. However, identification and annotation of bioluminescent proteins is a difficult task as they share poor sequence similarities among them. In this paper, we present a novel approach for within-class and between-class balancing as well as diversifying of a training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model. Further, we experimented by varying different levels of balancing ratio of positive data to negative data in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversified training set resulted in near complete learning with greater generalization on the blind test datasets. The obtained results strongly justify the fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near perfect learning. Using random forest as the weak learners in boosting and training it on the optimally balanced and diversified training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specificity of 91.8% on a holdout test set. It is quite possible that the general framework discussed in the current work can be successfully applied to other biological datasets to deal with imbalance and incomplete learning problems effectively.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computers in Biology and Medicine - Volume 68, 1 January 2016, Pages 27–36
نویسندگان
, ,