Large-scale k-means clustering via variance reduction

Article ID	Journal	Published Year	Pages	File Type
6863752	Neurocomputing	2018	33 Pages	PDF

Abstract

With the increase of the volume of data such as images in web, it is challenging to perform k-means clustering on millions or even billions of images efficiently. One of the reasons is that k-means requires to use a batch of training data to update cluster centers at every iteration, which is time-consuming. Conventionally, k-means is accelerated by using one or a mini-batch of instances to update the centers, which leads to a bad performance due to the stochastic noise. In the paper, we decrease such stochastic noise, and accelerate k-means by using variance reduction technique. Specifically, we propose a position correction mechanism to correct the drift of the cluster centers, and propose a variance reduced k-means named VRKM. Furthermore, we optimize VRKM by reducing its computational cost, and propose a new variant of the variance reduced k-means named VRKM++. Comparing with VRKM, VRKM++ does not have to compute the batch gradient, and is more efficient. Extensive empirical studies show that our methods VRKM and VRKM++ outperform the state-of-the-art method, and obtain about 2â¯Ãâ¯ and 4â¯Ãâ¯ speedups for large-scale clustering, respectively. The source code is available at https://www.github.com/YaweiZhao/VRKM_sofia-ml.

Keywords

K-means clustering Variance reduction