کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
6855154 | 1437607 | 2018 | 18 صفحه PDF | دانلود رایگان |
عنوان انگلیسی مقاله ISI
Sectional MinHash for near-duplicate detection
دانلود مقاله + سفارش ترجمه
دانلود مقاله ISI انگلیسی
رایگان برای ایرانیان
کلمات کلیدی
موضوعات مرتبط
مهندسی و علوم پایه
مهندسی کامپیوتر
هوش مصنوعی
پیش نمایش صفحه اول مقاله
چکیده انگلیسی
MinHash is a widely-used method for efficiently estimating the amount of similarity between documents for Near-Duplicate Detection (NDD). However, it is based on the concept of set resemblance rather than near-duplication. In this study, Sectional MinHash (S-MinHash), specifically designed for the detection of near-duplicate documents, is proposed. The proposed method enhances the MinHash data structure with information about the location of the attributes in the document. The method provides an unbiased estimate of the Jaccard coefficient with a smaller variance as compared to the MinHash for same signature sizes. The experiment results showed that the Mean Squared Error (MSE) of the proposed method was around one eighth of the MSE of the MinHash. Also, document NDD with the proposed method resulted in more accuracy in compare to the MinHash and the recent method, the BitHash. The best-captured F-measure was 87.05%. Setting the number of sections s to 2 gave the best results for the tested dataset.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 99, 1 June 2018, Pages 203-212
Journal: Expert Systems with Applications - Volume 99, 1 June 2018, Pages 203-212
نویسندگان
Roya Hassanian-esfahani, Mohammad-javad Kargar,