کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
493268 721685 2012 8 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Heuristic Frequent Term-Based Clustering of News Headlines
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر علوم کامپیوتر (عمومی)
پیش نمایش صفحه اول مقاله
Heuristic Frequent Term-Based Clustering of News Headlines
چکیده انگلیسی

Document clustering deals with assigning documents to groups (called clusters) in accordance with the general clustering rule, ‘high intra-cluster document similarity and low inter-cluster document similarity’. In this study, we propose a novel heuristics for clustering news headlines. News headlines are grammatically and semantically different from larger bodies of text, like blog posts and reviews. Based on the heuristics, we implemented versions of the frequent term-based and frequent noun-based clustering algorithms. Both these algorithms, along with k-means, regular frequent term and frequent noun clustering were evaluated using five datasets -Reuters343, Reuters2388 (news headlines), CICLing-2002, Hep-ex and KnCr (scientific abstracts). On interpreting the results based on common external cluster quality evaluation measures (purity, entropy and F measure), it was found that the heuristics performed at par with, or even better than, traditional clustering algorithms and few other intuitive algorithms, when tested using the datasets comprising of news headlines. However, on using the datasets comprising of scientific abstracts, the results were not favorable.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Procedia Technology - Volume 6, 2012, Pages 436-443