Article ID Journal Published Year Pages File Type
4961103 Procedia Computer Science 2017 6 Pages PDF
Abstract

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The recently published density peaks clustering algorithm can overcome the disadvantage of the distance-based algorithm that can only find clusters of nearly-circular shapes, instead it can discover clusters of arbitrary shapes and it is insensitive to noise data. However it needs calculate distances between all pairs of data points and is not scalable to the big data, in order to reduce the computational cost of the algorithm we propose an efficient distributed density peaks clustering algorithm based on Spark's GraphX. This paper proves the effectiveness of the method based on two different data set. The experimental results show our system can improve the performance significantly (up to 10x) comparing to MapReduce implementation. We also evaluate our system expansibility and scalability.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)
Authors
, , , , ,