Accurate discovery of co-derivative documents via duplicate text detection

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
397442	671198	2006	15 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Fingerprinting - اثر انگشت Hashing - هشیفتن Duplicate detection - کشف تکراری

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی

پیش نمایش صفحه اول مقاله

Accurate discovery of co-derivative documents via duplicate text detection

چکیده انگلیسی

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is hampered by an inability to accurately isolate information that is useful in identifying co-derivatives. In this paper we present spex, a novel hash-based algorithm for extracting duplicated chunks from a document collection. We discuss how information about shared chunks can be used for efficiently and reliably identifying co-derivative clusters, and describe deco, a prototype package that combines the spex algorithm with other optimisations and compressed indexing to produce a flexible and scalable co-derivative discovery system. Our experiments with multi-gigabyte document collections demonstrate the effectiveness of the approach.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Systems - Volume 31, Issue 7, November 2006, Pages 595–609

نویسندگان

Yaniv Bernstein, Justin Zobel,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Accurate discovery of co-derivative documents via duplicate text detection

دسترسی سریع

ارتباط

English Website