Heuristic Frequent Term-Based Clustering of News Headlines

Article ID	Journal	Published Year	Pages	File Type
493268	Procedia Technology	2012	8 Pages	PDF

Abstract

Document clustering deals with assigning documents to groups (called clusters) in accordance with the general clustering rule, ‘high intra-cluster document similarity and low inter-cluster document similarity’. In this study, we propose a novel heuristics for clustering news headlines. News headlines are grammatically and semantically different from larger bodies of text, like blog posts and reviews. Based on the heuristics, we implemented versions of the frequent term-based and frequent noun-based clustering algorithms. Both these algorithms, along with k-means, regular frequent term and frequent noun clustering were evaluated using five datasets -Reuters343, Reuters2388 (news headlines), CICLing-2002, Hep-ex and KnCr (scientific abstracts). On interpreting the results based on common external cluster quality evaluation measures (purity, entropy and F measure), it was found that the heuristics performed at par with, or even better than, traditional clustering algorithms and few other intuitive algorithms, when tested using the datasets comprising of news headlines. However, on using the datasets comprising of scientific abstracts, the results were not favorable.