Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
4956551	1444525	2016	29 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Distributed - توزیع شده Data mining - داده‌کاوی Cloud computing - رایانش ابری Parallel computing - رایانش موازی، محاسبات موازی MapReduce - نگاشت کاهش Big Data - کلان داده

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات

پیش نمایش صفحه اول مقاله

Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

چکیده انگلیسی

Mining with big data or big data mining has become an active research area. It is very difficult using current methodologies and data mining software tools for a single personal computer to efficiently deal with very large datasets. The parallel and cloud computing platforms are considered a better solution for big data mining. The concept of parallel computing is based on dividing a large problem into smaller ones and each of them is carried out by one single processor individually. In addition, these processes are performed concurrently in a distributed and parallel manner. There are two common methodologies used to tackle the big data problem. The first one is the distributed procedure based on the data parallelism paradigm, where a given big dataset can be manually divided into n subsets, and n algorithms are respectively executed for the corresponding n subsets. The final result can be obtained from a combination of the outputs produced by the n algorithms. The second one is the MapReduce based procedure under the cloud computing platform. This procedure is composed of the map and reduce processes, in which the former performs filtering and sorting and the later performs a summary operation in order to produce the final result. In this paper, we aim to compare the performance differences between the distributed and MapReduce methodologies over large scale datasets in terms of mining accuracy and efficiency. The experiments are based on four large scale datasets, which are used for the data classification problems. The results show that the classification performances of the MapReduce based procedure are very stable no matter how many computer nodes are used, better than the baseline single machine and distributed procedures except for the class imbalance dataset. In addition, the MapReduce procedure requires the least computational cost to process these big datasets.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Systems and Software - Volume 122, December 2016, Pages 83-92

نویسندگان

Chih-Fong Tsai, Wei-Chao Lin, Shih-Wen Ke,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Big data mining with parallel computing: A comparison of distributed and MapReduce methodologies

دسترسی سریع

ارتباط

English Website