Article ID Journal Published Year Pages File Type
392448 Information Sciences 2013 25 Pages PDF
Abstract

An enormous amount of information is continually being produced in current research, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection and text mining, involve large or enormous datasets. These datasets pose serious problems for many data mining algorithms.One method to address very large datasets is data reduction. Among the most useful data reduction methods is simultaneous instance and feature selection. This method achieves a considerable reduction in the training data while maintaining, or even improving, the performance of the data-mining algorithm. However, it suffers from a high degree of scalability problems, even for medium-sized datasets. In this paper, we propose a new evolutionary simultaneous instance and feature selection algorithm that is scalable to millions of instances and thousands of features.This proposal is based on the divide-and-conquer principle combined with bookkeeping. The divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the entire dataset into memory.Using 50 medium-sized datasets, we will demonstrate our method’s ability to match the results of state-of-the-art instance and feature selection methods while significantly reducing the time requirements. Using 13 very large datasets, we will demonstrate the scalability of our proposal to millions of instances and thousands of features.

► The paper proposes an scalable method for simultaneous instance and feature selection. ► The method is scalable to almost any size. ► The comparison shows it is faster and has a better performance than current methods. ► The method is applied to a largest dataset of 50,000,000 instances and 800 features.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , ,