Article ID Journal Published Year Pages File Type
6856190 Information Sciences 2018 18 Pages PDF
Abstract
Missing attribute values are prevalent in real relational data, especially the data extracted from the Web. Their accurate imputation is important for ensuring high quality of data analytics. Even though many techniques have been proposed for this task, none of them provides a flexible mechanism for quality control. The lack of quality guarantee may result in many missing data being filled with wrong values, which can easily result in biased data analysis. In this paper, we first propose a novel probabilistic framework based on the concept of Generalized Feature Dependency (GFD). By exploiting the monotonicity between imputation precision and match probability, it enables a flexible mechanism for quality control. We then present the imputation model with precision guarantee and the techniques to maximize recall while meeting a user-specified precision requirement. Finally, we evaluate the performance of the proposed approach on real data. Our extensive experiments show that it has performance advantage over the state-of-the-art alternatives and most importantly, its quality control mechanism is effective.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , , ,