کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
384659 660853 2013 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Detecting near-duplicate documents using sentence-level features and supervised learning
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Detecting near-duplicate documents using sentence-level features and supervised learning
چکیده انگلیسی

We present a novel method for detecting near-duplicates from a large collection of documents. Three major parts are involved in our method, feature selection, similarity measure, and discriminant derivation. To find near-duplicates to an input document, each sentence of the input document is fetched and preprocessed, the weight of each term is calculated, and the heavily weighted terms are selected to be the feature of the sentence. As a result, the input document is turned into a set of such features. A similarity measure is then applied and the similarity degree between the input document and each document in the given collection is computed. A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set, which is then employed to determine whether a document is a near-duplicate to the input document based on the similarity degree between them. The sentence-level features we adopt can better reveal the characteristics of a document. Besides, learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods. Experimental results show that our method is effective in near-duplicate document detection.


► A novel method for detecting near-duplicates from a large collection of documents is presented.
► The sentence-level features adopted can better reveal the characteristics of a document.
► A support vector machine (SVM) is adopted to learn a discriminant function from a training pattern set.
► Learning the discriminant function by SVM can avoid trial-and-error efforts required in conventional methods.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 40, Issue 5, April 2013, Pages 1467–1476
نویسندگان
, , ,