Speeding up the large-scale consensus fuzzy clustering for handling Big Data

Article ID	Journal	Published Year	Pages	File Type
8941773	Fuzzy Sets and Systems	2018	25 Pages	PDF

Abstract

Massive data can create a real competitive advantage for the companies; it is used to better respond to customers, to follow the behavior of consumers, to anticipate the evolutions, etc. However, it has its own deficiencies. This data volume not only requires big storage spaces but also makes analysis, processing and retrieval operations very difficult and hugely time-consuming. One way to overcome these problems is to cluster this data into a compact format that is still an informative version of the entire data. A lot of clustering algorithms have been proposed. However, their scaling is poor in terms of computation time whenever the size of the data gets larger. In this paper, we make full use of consensus clustering to handle Big Data clustering. We use sampling combined with a split-and-merge strategy to fragment data into small subsets, then basic partitions are locally generated from them using RHadoop's parallel processing MapReduce model and later a consensus tendency is followed to obtain the final result. A scalability analysis is conducted to demonstrate the performance of the proposed clustering models by increasing both the number of computing nodes used and the sample size while satisfying the volume and the velocity dimensions.

Keywords

Big data analytics Fuzzy clustering Sampling MapReduce