Article ID Journal Published Year Pages File Type
6872902 Future Generation Computer Systems 2018 49 Pages PDF
Abstract
Modern data is characterized by its ever-increasing volume and complexity, particularly when data instances belong to many categories simultaneously. This learning paradigm is known as multi-label classification and one of its most renowned methods is the multi-label k nearest neighbor ( Ml-knn). The traditional implementations of this method are not feasible for large-scale multi-label data due to its complexity and memory restrictions. We propose a distributed Ml-knn implementation based on the MapReduce programming model, implemented on Apache Spark. We compare three strategies for distributed nearest neighbor search: 1) iteratively broadcasting instances, 2) using a distributed tree-based index structure, and 3) building hash tables to group instances. The experimental study evaluates the trade-off between the quality of the predictions and runtimes on 22 benchmark datasets, and compares the scalability using different sizes of data. The results indicate that the tree-based index strategy outperforms the other approaches, having a speedup of up to 266x for the largest dataset, while achieving an accuracy equivalent to the exact methods. This strategy enables Ml-knn to scale efficiently with respect to the size of the problem.
Related Topics
Physical Sciences and Engineering Computer Science Computational Theory and Mathematics
Authors
, , ,