Article ID Journal Published Year Pages File Type
4946323 Knowledge-Based Systems 2017 19 Pages PDF
Abstract
In the era of Big Data, many datasets have a common characteristic, the large number of features. As a result, selecting the relevant features and ignoring the irrelevant and redundant features has become indispensable. However, when dealing with large amounts of data, most existing feature selection algorithms do not scale well, and their efficiency may significantly deteriorate to the point of becoming inapplicable. Moreover, data is often distributed in multiple locations, and it is not economic or legal to gather it in a single site. For these reasons, we propose a distributed approach for partitioned data using two techniques: horizontal (i.e. by samples) and vertical (i.e. by features). Unlike than existing procedures to combine the partial outputs obtained from each partition of data, we propose a merging process using the theoretical complexity of these feature subsets. The novel procedure tested in 11 datasets has proved to be useful, showing competitive results both in terms of runtime and classification accuracy.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , ,