Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10361295 | Pattern Recognition | 2015 | 40 Pages |
Abstract
Data clustering is an unsupervised learning task that has found many applications in various scientific fields. The goal is to find subgroups of closely related data samples (clusters) in a set of unlabeled data. Kernel k-Means is a state of the art clustering algorithm. However, in contrast to clustering algorithms that can work using only a limited percentage of the data at a time, Kernel k-Means is a global clustering algorithm. It requires the computation of the kernel matrix, which takes O(n2d) time and O(n2) space in memory. As datasets grow larger, the application of Kernel k-Means becomes infeasible on a single computer, a fact that strongly suggests a distributed approach. In this paper, we present such an approach to the Kernel k-Means clustering algorithm, in order to make its application to a large number of samples feasible and, thus, achieve high performance clustering results on very big datasets. Our distributed approach follows the MapReduce programming model and consists of 3 stages, the kernel matrix computation, a novel kernel matrix trimming method and the Kernel k-Means clustering algorithm.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Vision and Pattern Recognition
Authors
Nikolaos Tsapanos, Anastasios Tefas, Nikolaos Nikolaidis, Ioannis Pitas,