کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
393254 665586 2015 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A similarity assessment technique for effective grouping of documents
ترجمه فارسی عنوان
یک روش ارزیابی شباهت برای گروه بندی موثر اسناد
کلمات کلیدی
خوشه مستند، استخراج متن، داده کاوی کاربردی
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی

Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, since it generally contains huge documents and the document vectors are high dimensional. This paper introduces a hybrid document clustering technique by combining a new hierarchical and the traditional k-means clustering techniques. A distance function is proposed to find the distance between the hierarchical clusters. Initially the algorithm constructs some clusters by the hierarchical clustering technique using the new distance function. Then k-means algorithm is performed by using the centroids of the hierarchical clusters to group the documents that are not included in the hierarchical clusters. The major advantage of the proposed distance function is that it is able to find the nature of the corpora by varying a similarity threshold. Thus the proposed clustering technique does not require the number of clusters prior to executing the algorithm. In this way the initial random selection of k centroids for k-means algorithm is not needed for the proposed method. The experimental evaluation using Reuter, Ohsumed and various TREC data sets shows that the proposed method performs significantly better than several other document clustering techniques. F-measure and normalized mutual information are used to show that the proposed method is effectively grouping the text data sets.

Figure optionsDownload as PowerPoint slide

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 311, 1 August 2015, Pages 149–162
نویسندگان
, ,