کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
396780 670591 2016 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Single-pass and linear-time k-means clustering based on MapReduce
ترجمه فارسی عنوان
خوشه‌بندی k متوسط تک‌پاس و زمان خطی بر اساس "نگاشت کاهش"
کلمات کلیدی
توزیع K متوسط ؛ خوشه بندی داده؛ دسته بندی بر پایه نگاشت‌کاهش
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی


• mrk-means is a novel clustering algorithm which is based on MapReduce.
• mrk-means is single-pass and linear-time.
• mrk-means results in clusters that are O(log2k)-competitive to optimal solution.
• mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.

In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means   are O(log2k)-competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 60, August–September 2016, Pages 1–12
نویسندگان
, ,