Article ID Journal Published Year Pages File Type
489958 Procedia Computer Science 2015 8 Pages PDF
Abstract

The WWW has witnessed the exponential growth of web documents. People of all walks of life depend on the electronic superhighway, Internet, for retrieving information. Search engines retrieve data. Detecting near duplicate documents and handling them can help search engines to improve performance. In this paper, we proposed two algorithms. The first algorithm is meant for unsupervised probabilistic clustering of documents while the second algorithm is to detect near duplicates that can handle in offline processing of search engines. The clustered documents can avoid unnecessary comparisons while near duplicate detection algorithm involve local feature selection in are given document based on weights assigned to terms. A classifier is built to have supervised learning for discriminating documents. We proposed a framework named eXtensible Near Duplicate Detection Framework (XNDDF) which provides various components that provide room for flexible duplicate detection solutions besides showing offline and online processing required by a search engine. Our future work is to implement the framework components through a prototype application.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)