A strategy for allowing meaningful and comparable scores in approximate matching

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
397518	671267	2009	17 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Deduplication - تقلید کردن Data Cleaning - تمیز کردن داده ها Entity Resolution - قطعنامه سازمان Data integration - یکپارچه سازی داده ها

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

A strategy for allowing meaningful and comparable scores in approximate matching

چکیده انگلیسی

Approximate data matching aims at assessing whether two distinct instances of data represent the same real-world object. The comparison between data values is usually done by applying a similarity function which returns a similarity score. If this score surpasses a given threshold, both data instances are considered as representing the same real-world object. These score values depend on the algorithm that implements the function and have no meaning to the user. In addition, score values generated by different functions are not comparable. This will potentially lead to problems when the scores returned by different similarity functions need to be combined for computing the similarity between records. In this article, we propose that thresholds should be defined in terms of the precision that is expected from the matching process rather than in terms of the raw scores returned by the similarity function. Precision is a widely known similarity metric and has a clear interpretation from the user's point of view. Our approach defines mappings from score values to precision values, which we call adjusted scores. In order to obtain such mappings, our approach requires training over a small dataset. Experiments show that training can be reused for different datasets on the same domain. Our results also demonstrate that existing methods for combining scores for computing the similarity between records may be enhanced if adjusted scores are used.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 34, Issue 8, December 2009, Pages 673–689

نویسندگان

Carina F. Dorneles, Marcos Freitas Nunes, Carlos A. Heuser, Viviane P. Moreira, Altigran S. da Silva, Edleno S. de Moura,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

A strategy for allowing meaningful and comparable scores in approximate matching

دسترسی سریع

ارتباط

English Website