Sectional MinHash for near-duplicate detection

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
6855154	1437607	2018	18 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Locality sensitive hashing - محل حساس حساس

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Sectional MinHash for near-duplicate detection

چکیده انگلیسی

MinHash is a widely-used method for efficiently estimating the amount of similarity between documents for Near-Duplicate Detection (NDD). However, it is based on the concept of set resemblance rather than near-duplication. In this study, Sectional MinHash (S-MinHash), specifically designed for the detection of near-duplicate documents, is proposed. The proposed method enhances the MinHash data structure with information about the location of the attributes in the document. The method provides an unbiased estimate of the Jaccard coefficient with a smaller variance as compared to the MinHash for same signature sizes. The experiment results showed that the Mean Squared Error (MSE) of the proposed method was around one eighth of the MSE of the MinHash. Also, document NDD with the proposed method resulted in more accuracy in compare to the MinHash and the recent method, the BitHash. The best-captured F-measure was 87.05%. Setting the number of sections s to 2 gave the best results for the tested dataset.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 99, 1 June 2018, Pages 203-212

نویسندگان

Roya Hassanian-esfahani, Mohammad-javad Kargar,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Sectional MinHash for near-duplicate detection

دسترسی سریع

ارتباط

English Website