Learning to detect representative data for large scale instance selection

Article ID	Journal	Published Year	Pages	File Type
461022	Journal of Systems and Software	2015	8 Pages	PDF

Abstract

•The ReDD (Representative Data Detection) approach is introduced for large scale instance selection.•In ReDD, a detector learns the patterns of (un)representative data after performing instance selection.•Then, the detector is used to detect the newly added data.•We found that ReDD can reduce the computational cost and maintain the final classification accuracy.

Instance selection is an important data pre-processing step in the knowledge discovery process. However, the dataset sizes of various domain problems are usually very large, and some are even non-stationary, composed of both old data and a large amount of new data samples. Current algorithms for solving this type of scalability problem have certain limitations, meaning they require a very high computational cost over very large scale datasets during instance selection. To this end, we introduce the ReDD (Representative Data Detection) approach, which is based on outlier pattern analysis and prediction. First, a machine learning model, or detector, is used to learn the patterns of (un)representative data selected by a specific instance selection method from a small amount of training data. Then, the detector can be used to detect the rest of the large amount of training data, or newly added data. We empirically evaluate ReDD over 50 domain datasets to examine the effectiveness of the learned detector, using four very large scale datasets for validation. The experimental results show that ReDD not only reduces the computational cost nearly two or three times by three baselines, but also maintains the final classification accuracy.

Keywords

Instance Selection Data mining Data reduction