Modified criterion to select useful unlabeled data for improving semi-supervised support vector machines

Article ID	Journal	Published Year	Pages	File Type
533738	Pattern Recognition Letters	2015	9 Pages	PDF

Abstract

•A small amount of unlabeled data was selected to enhance classification accuracy of S3VMs.•To select them efficiently, impacts of the labeled data and the unlabeled data were balanced.•The class-conditional probabilities of unlabeled samples were utilized as uncertainty levels.•Run-time characteristics and error rates of the modified criterion were empirically evaluated.

Recent studies have demonstrated that semi-supervised learning (SSL) approaches that use both labeled and unlabeled data are more effective and robust than those that use only labeled data. In SemiBoost, a boosting framework for SSL, a similarity based criterion is developed to select (and utilize) a small amount of useful unlabeled data. However, sometimes it does not work appropriately, particularly when the unlabeled data are near the boundary. In order to address this concern, in this paper the selection criterion is modified using the class-conditional probability in addition to the similarity: first, the criterion is decomposed into three terms of positive class term, negative class term, and unlabeled term; second, when computing the confidences of unlabeled data, using the conditional probability estimated, impacts of the three terms on the confidences are adjusted; third, some unlabeled data that have higher confidences are selected and, together with labeled data, used for re-training a supervised classifier. This select-and-train process is repeated until a termination condition is met. The experimental results, obtained using semi-supervised support vector machines (S3VMs) with benchmark data, demonstrate that the proposed algorithm can compensate for the shortcomings of the traditional S3VMs and, when compared with previous approaches, can achieve further improved results in terms of the classification accuracy.

Keywords

Support vector machines Semi-supervised learning