On retrieving intelligently plagiarized documents using semantic similarity

Article ID	Journal	Published Year	Pages	File Type
380380	Engineering Applications of Artificial Intelligence	2015	13 Pages	PDF

Abstract

•A novel semantic similarity measure for retrieving potentially plagiarized documents.•An architecture for fast retrieval of plagiarized documents.•Build a dataset for plagiarism detection with intelligently paraphrased contents.•Compare our results with plagiarism detection software Turnitin and search engines.

Plagiarism in text documents can be done in many ways. The most common form of plagiarizing a text document is to copy a chunk of text and alter it intelligently, thereby making it look original. Such cases are hard to detect since they require semantic analysis of the document. External sources of knowledge such as WordNet have been employed to help detect such cases. However, such an approach might often miss the contextual significance of the employed words, as well as suffer from the issue of synonymy and polysemy. We propose an architecture that uses a semantic similarity measure that exploits the semantic similarity of words, as mined from within the data corpus, thereby using localized contextual information. In this work, an approach for detecting plagiarism in text document has been proposed using a semantic similarity measure with a Nearest Neighbor (NN) search, and using a kernel in multiclass support vector machine. We test our approach on a plagiarism dataset specially developed to test the efficacy of the solution with varying level of plagiarism. The results have been compared with that of well-known commercial software, Turnitin®, having access to a large database. Our experiments suggest that using semantic kernels can help detect plagiarism, which can outsmart available techniques.

Keywords

Information retrieval Plagiarism detection Semantic similarity Support vector machine