Iterative random projections for high-dimensional data clustering

Article ID	Journal	Published Year	Pages	File Type
535786	Pattern Recognition Letters	2012	7 Pages	PDF

Abstract

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data after each convergence of K-means. We compare the proposed algorithm on four high-dimensional datasets, image, text and two synthetic, with K-means clustering using a single random projection and K-means clustering of the original high-dimensional data. Regarding time we observe that the algorithm reduces drastically the time when compared to K-means on the original high-dimensional data. Regarding mean squared error the proposed method reaches a better solution than clustering using a single random projection. More notably in the experiments performed it also reaches a better solution than clustering on the original high-dimensional data.

► We propose a novel iterative method for the hard-clustering problem suited for high-dimensional data. ► The method builds on K-means and random projections (RP) and relates to simulated annealing (SA) clustering. ► Experiments on an image, a text and two synthetic datasets show good performance regarding both accuracy and time. ► We achieve a lower MSE than using a single RP and than classic K-means. ► The running time can still be faster than classic K-means, unlike SA clustering which is drastically slower.

Keywords

Clustering High-dimensional data k-Means Random projections