Error estimation based on variance analysis of k-fold cross-validation

Article ID	Journal	Published Year	Pages	File Type
4969680	Pattern Recognition	2017	33 Pages	PDF

Abstract

Cross-validation (CV) is often used to estimate the generalization capability of a learning model. The variance of CV error has a considerable impact on the accuracy of CV estimator and the adequacy of the learning model, so it is very important to analyze CV variance. The aim of this paper is to investigate how to improve the accuracy of the error estimation based on variance analysis. We first describe the quantitative relationship between CV variance and its accuracy, which can provide guidance for improving the accuracy by reducing the variance. We then study the relationships between variance and relevant variables including the sample size, the number of folds, and the number of repetitions. These form the basis of theoretical strategies of regulating CV variance. Our classification results can theoretically explain the empirical results of RodrÃguez and Kohavi. Finally, we propose a uniform normalized variance which not only measures model accuracy but also is irrelative to fold number. Therefore, it simplifies the selection of fold number in k-fold CV and normalized variance can serve as a stable error measurement for model comparison and selection. We report the results of experiments using 5 supervised learning models and 20 datasets. The results indicate that it is reliable to determine which variance is less before k-fold CV by the proposed theorems, and thus the accuracy of error estimation can be promoted by reducing variance. In so doing, we are more likely to select the best parameter or model.

Keywords

k-Fold cross-validation Model selection Error estimation Variance analysis