An embedded imputation method via Attribute-based Decision Graphs

Article ID	Journal	Published Year	Pages	File Type
381970	Expert Systems with Applications	2016	19 Pages	PDF

Abstract

•Attribute-based Decision Graphs represent the correlation among data attributes.•Similar data instances induce similar subgraphs in the AbDG.•Imputation partially matches instances to the AbDG searching for a proper subgraph.•The method has low computational costs and handles high rates of missing values.•Results show the method is efficient to impute data prior to classification tasks.

The performance of classification algorithms is highly dependent on the quality of training data. Missing attribute values are quite common in many real world applications, thus, in such cases, a complementary method to improve the quality of the data and, consequently, promote enhancements of the classifier performance, is necessary. To deal with this problem, two strategies are commonly employed in practice, 1) multiple imputation, which often maintains the statistical properties of the original data and, usually, has good performance, at the expense of high computational costs; 2) single imputation, which, in general, provides a suitable solution for data sets with a few missing attribute values, but hardly achieve good results when the number of missing values is high. This paper proposes a new single imputation method which uses Attribute-based Decision Graphs (AbDG) to estimate the missing values. AbDGs are a new type of data graphs which embed the information contained in the training set into a graph structure, built over pre-defined intervals of values from different attributes. As a consequence, similar data instances induce similar subgraphs when projected onto the AbDG, resulting in distinct patterns of connections. The main contribution of the paper is the proposal of a well-defined procedure to perform imputation, by partially matching instances with missing values against the AbDG. The proposed imputation method can effectively deal with data sets having high rates of missing attribute values while presenting low computational cost; a significant result towards the development of robust expert and intelligent systems. The obtained results show evidences that the proposed method is sound and promote qualitative imputation for classification purposes.

Keywords

Single imputation Data imputation