A case-study on naïve labelling for the nearest mean and the linear discriminant classifiers

Article ID	Journal	Published Year	Pages	File Type
531556	Pattern Recognition	2008	11 Pages	PDF

Abstract

The abundance of unlabelled data alongside limited labelled data has provoked significant interest in semi-supervised learning methods. “Naïve labelling” refers to the following simple strategy for using unlabelled data in on-line classification. A new data point is first labelled by the current classifier and then added to the training set together with the assigned label. The classifier is updated before seeing the subsequent data point. Although the danger of a run-away classifier is obvious, versions of naïve labelling pervade in on-line adaptive learning. We study the asymptotic behaviour of naïve labelling in the case of two Gaussian classes and one variable. The analysis shows that if the classifier model assumes correctly the underlying distribution of the problem, naïve labelling will drive the parameters of the classifier towards their optimal values. However, if the model is not guessed correctly, the benefits are outweighed by the instability of the labelling strategy (run-away behaviour of the classifier). The results are based on exact calculations of the point of convergence, simulations, and experiments with 25 real data sets. The findings in our study are consistent with concerns about general use of unlabelled data, flagged up in the recent literature.

Keywords

Semi-supervised learning