کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
469075 698284 2016 16 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset
ترجمه فارسی عنوان
یک رویکرد MapReduce برای کاهش پارامترهای عدم تعادل برای مجموعه داده های بزرگ دی اکسید ریبونوکلئیک اسید
کلمات کلیدی
MapReduce؛ K نزدیک ترین همسایه؛ اطلاعات بزرگ؛ DNA (deoxyribonucleic acid)؛ زیست شناسی محاسباتی؛ داده های عدم تعادل
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر علوم کامپیوتر (عمومی)
چکیده انگلیسی


• Imbalanced data sets are considered a special case for the classification problems.
• Map reducing with prototype reduction can easily handle large scale data set with good speedup and less time consuming.
• For high quality, four reduction types: data cleaning, rule aggregation, rule synthesis and rule update were compared.
• A real DNA dataset consists of 90 million pair has been used with reduction types.
• The proposed MapReduce based K-NN classifier reduced the imbalance data set and achieved accurate results for the DNA data.

BackgroundIn the age of information superhighway, big data play a significant role in information processing, extractions, retrieving and management. In computational biology, the continuous challenge is to manage the biological data. Data mining techniques are sometimes imperfect for new space and time requirements. Thus, it is critical to process massive amounts of data to retrieve knowledge. The existing software and automated tools to handle big data sets are not sufficient. As a result, an expandable mining technique that enfolds the large storage and processing capability of distributed or parallel processing platforms is essential.MethodIn this analysis, a contemporary distributed clustering methodology for imbalance data reduction using k-nearest neighbor (K-NN) classification approach has been introduced. The pivotal objective of this work is to illustrate real training data sets with reduced amount of elements or instances. These reduced amounts of data sets will ensure faster data classification and standard storage management with less sensitivity. However, general data reduction methods cannot manage very big data sets. To minimize these difficulties, a MapReduce-oriented framework is designed using various clusters of automated contents, comprising multiple algorithmic approaches.ResultsTo test the proposed approach, a real DNA (deoxyribonucleic acid) dataset that consists of 90 million pairs has been used. The proposed model reduces the imbalance data sets from large-scale data sets without loss of its accuracy.ConclusionsThe obtained results depict that MapReduce based K-NN classifier provided accurate results for big data of DNA.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Methods and Programs in Biomedicine - Volume 131, July 2016, Pages 191–206
نویسندگان
, , , , ,