کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
10150991 1666104 2018 52 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Handling missing values: A study of popular imputation packages in R
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Handling missing values: A study of popular imputation packages in R
چکیده انگلیسی
In real world data are often plagued by missing values which adversely affects the final outcome of the analysis based on such data. The missing values can be handled using various techniques like deletion or imputation. Of late, R has become one of the most preferred platform for carrying out data analysis, and its popularity is growing further. R provides various packages for handling missing values through imputation. The presence of multiple packages however, calls for an analysis of their comparative performance and examine their suitability for handling a given set of data. The performance of different R packages may differ for different datasets and may depend on the size of the dataset and richness of the missing values in the datasets. In this paper, the authors perform comparative study of the performance of the common R packages, namely VIM, MICE, MissForest, and HMISC, used for missing value imputation. The authors measured the performances of the said packages in terms of their imputation time, imputation efficiency and the effect on the variance. The imputation efficiency was measured in terms of the difference in predictive performance of a model built using original dataset vis-à-vis a dataset with imputed values. Similarly, the variance of the variables in the original dataset was compared that of corresponding variables in the imputed dataset. A missing value imputation package can be considered to be better if it consumes less imputation time and provides high imputation accuracy. Also in terms of variance, one would like to have the imputation package maintain the original variance of the variables. On analysing the four imputation packages on two datasets over three predictive algorithms-Logistic Regression, Support Vector Machines, and Artificial Neural Networks-it was observed that the performances varies depending on the size of the dataset, and the missing values present in them. The study highlights that certain missing value package used in conjunction with a given predictive algorithm provides better performance, which is again a function of the dataset characteristics.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Knowledge-Based Systems - Volume 160, 15 November 2018, Pages 104-118
نویسندگان
, ,