Article ID Journal Published Year Pages File Type
461121 Journal of Systems and Software 2013 12 Pages PDF
Abstract

ContextMore than half the literature on software effort estimation (SEE) focuses on model comparisons. Each of those requires a sampling method (SM) to generate the train and test sets. Different authors use different SMs such as leave-one-out (LOO), 3Way and 10Way cross-validation. While LOO is a deterministic algorithm, the N-way methods use random selection to build their train and test sets. This introduces the problem of conclusion instability where different authors rank effort estimators in different ways.ObjectiveTo reduce conclusion instability by removing the effects of a sampling method's random test case generation.MethodCalculate bias and variance (B&V) values following the assumption that a learner trained on the whole dataset is taken as the true model; then demonstrate that the B&V and runtime values for LOO are similar to N-way by running 90 different algorithms on 20 different SEE datasets. For each algorithm, collect runtimes, B&V values under LOO, 3Way and 10Way.ResultsWe observed that: (1) the majority of the algorithms have statistically indistinguishable B&V values under different SMs and (2) different SMs have similar run times.ConclusionIn terms of their generated B&V values and runtimes, there is no reason to prefer N-way over LOO. In terms of reproducibility, LOO removes one cause of conclusion instability (the random selection of train and test sets). Therefore, we depreciate N-way and endorse LOO validation for assessing effort models.

Related Topics
Physical Sciences and Engineering Computer Science Computer Networks and Communications
Authors
, ,