کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
4956475 1444519 2017 25 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An efficient spark-based adaptive windowing for entity matching
ترجمه فارسی عنوان
یک پوشش سازگار مبتنی بر جرقه مناسب برای سازگاری سازمانی
کلمات کلیدی
بلوط سازگار، تطابق سازنده، تعادل بار، جرقه،
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
چکیده انگلیسی
Entity Matching (EM), i.e., the task of identifying records that refer to the same entity, is a fundamental problem in every information integration and data cleansing system, e.g., to find similar product descriptions in databases. The EM task is known to be challenging when the datasets involved in the matching process have a high volume due to its pair-wise nature. For this reason, studies about challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark (Spark), have become an important demand nowadays (Christen, 2012a; Kolb et al., 2012b). The effectiveness and scalability of Spark-based implementations for EM depend on how well the workload distribution is balanced among all workers. In this article, we investigate how Spark can be used to perform efficiently (load balanced) parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying (adaptive) window size. We propose Spark Duplicate Count Strategy (S-DCS++), a Spark-based approach for adaptive SNM, aiming to increase even more the performance of this method. The evaluation results, based on real-world datasets and cluster infrastructure, show that our approach increases the performance of parallel DCS++ regarding the EM execution time.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Systems and Software - Volume 128, June 2017, Pages 1-10
نویسندگان
, , , , , ,