Article ID Journal Published Year Pages File Type
392388 Information Sciences 2013 26 Pages PDF
Abstract

In this article, we introduce and investigate 4DS, a new selection strategy for pool-based active training of a generative classifier, namely CMM (classifier based on a probabilistic mixture model). Such a generative classifier aims at modeling the processes underlying the “generation” of the data. 4DS considers the distance of samples (observations) to the decision boundary, the density in regions, where samples are selected, the diversity of samples in the query set that are chosen for labeling, and, indirectly, the unknown class distribution of the samples by utilizing the responsibilities of the model components for these samples. The combination of the four measures in 4DS is self-optimizing in the sense that the weights of the distance, density, and class distribution measures depend on the currently estimated performance of the classifier. With 17 benchmark data sets it is shown that 4DS outperforms a random selection strategy (baseline method), a pure closest sampling approach, ITDS (information theoretic diversity sampling), DWUS (density-weighted uncertainty sampling), DUAL (dual strategy for active learning), PBAC (prototype based active learning), and 3DS (a technique we proposed earlier that does not consider responsibility information) regarding various evaluation criteria such as ranked performance based on classification accuracy, number of labeled samples (data utilization), and learning speed assessed by the area under the learning curve. It is also shown that—due to the use of responsibility information—4DS solves a key problem of active learning: The class distribution of the samples chosen for labeling actually approximates the unknown “true” class distribution of the overall data set quite well. With this article, we also pave the way for advanced selection strategies for an active training of discriminative classifiers such as support vector machines or decision trees: We show that responsibility information derived from generative models can successfully be employed to improve the training of those classifiers.

► 4DS uses responsibility information to consider the unknown class distribution of the samples –4DS weights distance, density, and class distribution information adaptively (self-optimizing). ► A new evaluation criterion for active learning, the class distribution match, is defined. ► With 17 benchmark data sets it is shown that 4DS outperforms seven well-known selection strategies.

Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, ,