Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
396780 | Information Systems | 2016 | 12 Pages |
•mrk-means is a novel clustering algorithm which is based on MapReduce.•mrk-means is single-pass and linear-time.•mrk-means results in clusters that are O(log2k)-competitive to optimal solution.•mrk-means is both faster and more accurate than Apache Mahout and GraphLab k-means.
In recent years, k-means has been fitted into the MapReduce framework and hence it has become a very effective solution for clustering very large datasets. However, k-means is not inherently suitable for execution in MapReduce. The iterative nature of k-means cannot be modeled in MapReduce and hence for each iteration of k-means an independent MapReduce job must be executed and this results in high I/O overhead because in each iteration the whole dataset must be read and written to slow disks. We have proposed a single-pass solution based on MapReduce called mrk-means which uses the reclustering technique. In contrast to available MapReduce-based k-means implementations, mrk-means just reads the dataset once and hence it is several times faster. The time complexity of mrk-means is linear which is lower than the iterative k-means. Due to usage of k-means++ seeding algorithm, mrk-means results in clusters with higher quality, too. Theoretically, the results of mrk-means are O(log2k)-competitive to optimal clustering in the worst case, considering k as the number of clusters. During our experiments which were done on a cluster of 40 machines running the Hadoop framework, mrk-means showed both faster execution times, and higher quality of clustering results compared to available MapReduce-based and stream-based k-means variants.