CHI-PG: A fast prototype generation algorithm for Big Data classification problems

Article ID	Journal	Published Year	Pages	File Type
6864356	Neurocomputing	2018	29 Pages	PDF

Abstract

The growing amount of available data has become a serious challenge to data mining and machine learning techniques. Well-known classification methods that have been widely applied so far are no longer feasible in Big Data environments. For this reason, prototype reduction techniques (both selection and generation) come up as a candidate solution to build a reduced version of the dataset that speeds up the execution of algorithms such as k-Nearest Neighbors and overcome their memory constraints. However, these solutions generally have a quadratic O(N2) time complexity and share similar limitations to those encountered in data mining and machine learning algorithms in terms of time and memory requirements. In order to overcome these limitations, we introduce a new distributed MapReduce prototype generation method called CHI-PG that provides a linear O(N) time complexity and ensures constant accuracy regardless of the degree of parallelism. This approach builds prototypes by applying a simple scheme based on the rule generation process of the Chi etÂ al. Fuzzy Rule-Based Classification System and takes advantage of the suitability of this classifier for the MapReduce paradigm. The empirical study shows that our new approach significantly improves the execution time of a state-of-the-art distributed prototype reduction algorithm (MRPR) without decreasing (and even improving) classification accuracy and reduction rates. Moreover, CHI-PG has been shown to be a candidate solution to the time and memory constraints of k-Nearest Neighbors when tackling large-scale datasets.

Keywords

Prototype reduction Prototype generation Fuzzy Rule-Based Classification Systems MapReduce Big Data