Splitting criteria for classification problems with multi-valued attributes and large number of classes

Article ID	Journal	Published Year	Pages	File Type
6940208	Pattern Recognition Letters	2018	9 Pages	PDF

Abstract

Decision Trees and Random Forests are among the most popular methods for classification tasks. Two key issues faced by these methods are: how to select the best attribute to associate with a node and how to split the samples given the selected attribute. This paper addresses an important challenge that arises when nominal attributes with a large number of values are present: the computational time required to compute splits of good quality. We present a framework to generate computationally efficient splitting criteria that handle, with theoretical approximation guarantee, multi-valued nominal attributes for classification tasks with a large number of classes. Experiments with a number of datasets suggest that a method derived from our framework is competitive in terms of accuracy and speed with the Twoing criterion, one of few criteria available that is able to handle, with optimality guarantee, nominal attributes with a large number of distinct values. However, this method has the advantage of also efficiently handling datasets with a large number of classes. These experiments also give evidence of the potential of aggregating attributes to improve the classification power of decision trees.

Keywords

41A05 65D05 41A10 65D17 approximation algorithms Attribute selection Decision trees MAX-CUT problem