Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets

Article ID	Journal	Published Year	Pages	File Type
391733	Information Sciences	2016	19 Pages	PDF

Abstract

•The use of ordering based pruning approaches for ensemble learning in imbalanced classification is proposed.•Standard pruning schemes have been adapted to the framework of imbalanced data.•BB-Imb and RE-GMmetrics allow a significant gain in the studied models, allowing baseline methodologies to be outperformed.•The Boosting Based Imbalanced approach in conjunction with UnderBagging has excelled as the best option.•Conclusions are supported by a thorough experimental study with 66 datasets.

The scenario of classification with imbalanced datasets has gained a notorious significance in the last years. This is due to the fact that a large number of problems where classes are highly skewed may be found, affecting the global performance of the system. A great number of approaches have been developed to address this problem. These techniques have been traditionally proposed under three different perspectives: data treatment, adaptation of algorithms, and cost-sensitive learning.Ensemble-based models for classifiers are an extension over the former solutions. They consider a pool of classifiers, and they can in turn integrate any of these proposals. The quality and performance of this type of methodology over baseline solutions have been shown in several studies of the specialized literature.The goal of this work is to improve the capabilities of tree-based ensemble-based solutions that were specifically designed for imbalanced classification, focusing on the best behaving bagging- and boosting-based ensembles in this scenario. In order to do so, this paper proposes several new metrics for ordering-based pruning, which are properly adapted to address the skewed-class distribution. From our experimental study we show two main results: on the one hand, the use of the new metrics allows pruning to become a very successful approach in this scenario; on the other hand, the behavior of Under-Bagging model excels, achieving the highest gain with the usage of pruning, since the random undersampled sets that best complement each other can be selected. Accordingly, this scheme is capable of outperforming previous ensemble models selected from the state-of-the-art.

Keywords

Bagging Boosting Imbalanced datasets