Software effort models should be assessed via leave-one-out validation

Article ID	Journal	Published Year	Pages	File Type
461121	Journal of Systems and Software	2013	12 Pages	PDF

Abstract

ContextMore than half the literature on software effort estimation (SEE) focuses on model comparisons. Each of those requires a sampling method (SM) to generate the train and test sets. Different authors use different SMs such as leave-one-out (LOO), 3Way and 10Way cross-validation. While LOO is a deterministic algorithm, the N-way methods use random selection to build their train and test sets. This introduces the problem of conclusion instability where different authors rank effort estimators in different ways.ObjectiveTo reduce conclusion instability by removing the effects of a sampling method's random test case generation.MethodCalculate bias and variance (B&V) values following the assumption that a learner trained on the whole dataset is taken as the true model; then demonstrate that the B&V and runtime values for LOO are similar to N-way by running 90 different algorithms on 20 different SEE datasets. For each algorithm, collect runtimes, B&V values under LOO, 3Way and 10Way.ResultsWe observed that: (1) the majority of the algorithms have statistically indistinguishable B&V values under different SMs and (2) different SMs have similar run times.ConclusionIn terms of their generated B&V values and runtimes, there is no reason to prefer N-way over LOO. In terms of reproducibility, LOO removes one cause of conclusion instability (the random selection of train and test sets). Therefore, we depreciate N-way and endorse LOO validation for assessing effort models.

Keywords

Software cost estimation Bias Prediction system Variance