XNDDF: Towards a Framework for Flexible Near-Duplicate Document Detection Using Supervised and Unsupervised Learning

Article ID	Journal	Published Year	Pages	File Type
489958	Procedia Computer Science	2015	8 Pages	PDF

Abstract

The WWW has witnessed the exponential growth of web documents. People of all walks of life depend on the electronic superhighway, Internet, for retrieving information. Search engines retrieve data. Detecting near duplicate documents and handling them can help search engines to improve performance. In this paper, we proposed two algorithms. The first algorithm is meant for unsupervised probabilistic clustering of documents while the second algorithm is to detect near duplicates that can handle in offline processing of search engines. The clustered documents can avoid unnecessary comparisons while near duplicate detection algorithm involve local feature selection in are given document based on weights assigned to terms. A classifier is built to have supervised learning for discriminating documents. We proposed a framework named eXtensible Near Duplicate Detection Framework (XNDDF) which provides various components that provide room for flexible duplicate detection solutions besides showing offline and online processing required by a search engine. Our future work is to implement the framework components through a prototype application.