Assessing the predictive accuracy of diversity measures with domain-dependent, asymmetric misclassification costs

Article ID	Journal	Published Year	Pages	File Type
10359968	Information Fusion	2005	12 Pages	PDF

Abstract

We explore the relationship between diversity measures and ensemble performance, for binary classification with simple majority voting, within a problem domain characterized by asymmetric misclassification costs. Extending the work of Kuncheva and Whitaker [Machine Learning 51(2) (2003) 181], we compare a set of diversity measures within two different data representations. The first is a direct representation, which explicitly allows for consideration of asymmetric costs by indicating the specific values of the predictions--which in turn allows for a distinction between more costly misclassifications in this domain (i.e., actual 0 predicted as 1) and less costly ones (i.e., actual 1 predicted as 0). The second is an oracle representation, which indicates predictions as either correct or incorrect, and therefore does not allow for asymmetric costs. Within these representations we identified and manipulated certain situational factors, including the percentage of target group members in the population and the designed accuracy and sensitivity of each constituent model. Based on a neural network comparison of diversity measures and ensemble performance, we found that (1) diversity measure association with ensemble performance is contingent on the data representation, with Yule's Q-statistic and the coincident failure measure (CFD) as the best indicators in the direct representation and CFD alone as best indicator in the oracle representation, and (2) diversity measure association with ensemble performance varies as situational factors are manipulated; that is, diversity measures are differentially effective at different factor levels. Thus, the choice of a diversity measure in assessing ensemble classification performance requires an examination of both the nature of the task domain and the specific factors that comprise the domain.

Keywords

Data mining Multiple classifiers