An efficient data reduction method and its application to cluster analysis

Article ID	Journal	Published Year	Pages	File Type
4947622	Neurocomputing	2017	11 Pages	PDF

Abstract

Data reduction plays a very important role in the data mining field, but the existing methods have not been able to efficiently identify all major features which are hidden in the large datasets. On some occasions, they even cause the loss of the original key features. In this paper, a new efficient measure was developed to reduce a given dataset and to uncover the major features by multiplying the defined absolute density with the defined local density of any data. These two kinds of densities were estimated with a fast grid-based bisecting method. To test its performance on feature reduction and sample reduction, a group of feature-different synthetic datasets and 24 benchmark datasets were used as examples and the clustering accuracy, runtime and separability among clusters were used as measurements. The results strongly proved the proposed method could fast reduce a dataset and identify the most important key features. Additionally, it also can effectively determine the optimal number of clusters by suppressing the noisy data and enhancing the separation among clusters.

Keywords

Sample reduction Dimensionality reduction Data reduction