Single-pass and linear-time k-means clustering based on MapReduce

Article ID	Journal	Published Year	Pages	File Type
396780	Information Systems	2016	12 Pages	PDF

Abstract

•mrk-means is a novel clustering algorithm which is based on MapReduce.•mrk-means is single-pass and linear-time.•mrk-means results in clusters that are O(log2k)-competitive to optimal solution.•mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.

In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means are O(log2k)-competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants.

Keywords

Data clustering