کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
6885370 1444510 2018 20 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Heuristic-based approaches for speeding up incremental record linkage
ترجمه فارسی عنوان
رویکردهای مبتنی بر اکتشافی برای سرعت بخشیدن به پیوند پیوسته افزایشی
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
چکیده انگلیسی
Record Linkage is the task of processing a dataset in order to identify which records refer to the same real world entity. The intrinsic complexity of this task brings many challenges to traditional or naive approaches, especially in contexts such as Big Data, unstructured data and frequent data increments over the dataset. To deal with these contexts, especially the latter, an incremental record linkage approach may be employed in order to avoid (re)processing the entire dataset to update the deduplication results. For doing so, different classification techniques can be employed to identify duplicate entities. Recently, many algorithms have been proposed to combine collective classification, which employs clustering algorithms, together with the incremental principle. In this article, we propose new metrics for incremental record linkage using collective classification and new heuristics (which combine clustering, coverage component filters and a greedy approach) to speed up even more a solution to incremental record linkage. These heuristics have been evaluated using three different scale datasets and the results were analyzed and discussed based on both classical and the newly proposed metrics. The experiments present different trade-offs, regarding efficacy and efficiency results, which are generated by the considered heuristics. Also, the results indicate that, for large and frequent data increments, it is possible to slightly reduce efficacy results by employing a coverage filter-based heuristic that is reasonably faster than the current state-of-the-art approach. In turn, it is also possible to employ single-pass clustering algorithms, which are able to execute significantly faster than the state-of-the-art approach at the cost of sacrificing precision results.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Systems and Software - Volume 137, March 2018, Pages 335-354
نویسندگان
, , ,