Article ID Journal Published Year Pages File Type
6863752 Neurocomputing 2018 33 Pages PDF
Abstract
With the increase of the volume of data such as images in web, it is challenging to perform k-means clustering on millions or even billions of images efficiently. One of the reasons is that k-means requires to use a batch of training data to update cluster centers at every iteration, which is time-consuming. Conventionally, k-means is accelerated by using one or a mini-batch of instances to update the centers, which leads to a bad performance due to the stochastic noise. In the paper, we decrease such stochastic noise, and accelerate k-means by using variance reduction technique. Specifically, we propose a position correction mechanism to correct the drift of the cluster centers, and propose a variance reduced k-means named VRKM. Furthermore, we optimize VRKM by reducing its computational cost, and propose a new variant of the variance reduced k-means named VRKM++. Comparing with VRKM, VRKM++ does not have to compute the batch gradient, and is more efficient. Extensive empirical studies show that our methods VRKM and VRKM++ outperform the state-of-the-art method, and obtain about 2 ×  and 4 ×  speedups for large-scale clustering, respectively. The source code is available at https://www.github.com/YaweiZhao/VRKM_sofia-ml.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , , , , ,