An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
1164973	1491026	2014	11 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

PubChem Over-sampling - بیش از نمونه برداری under-sampling - زیر نمونه برداری Imbalanced Classification - طبقه بندی نامتعادل high-throughput screening - غربالگری بالا

موضوعات مرتبط

مهندسی و علوم پایه شیمی شیمی آنالیزی یا شیمی تجزیه

پیش نمایش صفحه اول مقاله

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

چکیده انگلیسی

• A GLMBoost coupled with SMOTE algorithm is proposed to classify imbalanced data.
• It is easy for non-statistical experts to employ for classifying imbalanced data.
• GLMBoost proves to have stronger predictive power and higher computational efficiency.

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost + SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF + SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem.

Flow chart for the proposed combinatorial algorithm with SMOTE and statistical methods for imbalanced data.Figure optionsDownload as PowerPoint slide

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Analytica Chimica Acta - Volume 806, 2 January 2014, Pages 117–127

نویسندگان

Ming Hao, Yanli Wang, Stephen H. Bryant,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data

دسترسی سریع

ارتباط

English Website