کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
396391 666425 2006 27 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Class normalization in centroid-based text categorization
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Class normalization in centroid-based text categorization
چکیده انگلیسی

Centroid-based categorization is one of the most popular algorithms in text classification. In this approach, normalization is an important factor to improve performance of a centroid-based classifier when documents in text collection have quite different sizes and/or the numbers of documents in classes are unbalanced. In the past, most researchers applied document normalization, e.g., document-length normalization, while some consider a simple kind of class normalization, so-called class-length normalization, to solve the unbalancedness problem. However, there is no intensive work that clarifies how these normalizations affect classification performance and whether there are any other useful normalizations. The purpose of this paper is three folds; (1) to investigate the effectiveness of document- and class-length normalizations on several data sets, (2) to evaluate a number of commonly used normalization functions and (3) to introduce a new type of class normalization, called term-length normalization, which exploits term distribution among documents in the class. The experimental results show that a classifier with weight–merge–normalize approach (class-length normalization) performs better than one with weight–normalize–merge approach (document-length normalization) for the data sets with unbalanced numbers of documents in classes, and is quite competitive for those with balanced numbers of documents. For normalization functions, the normalization based on term weighting performs better than the others on average. For term-length normalization, it is useful for improving classification accuracy. The combination of term- and class-length normalizations outperforms pure class-length normalization and pure term-length normalization as well as unnormalization with the gaps of 4.29%, 11.50%, 30.09%, respectively.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Sciences - Volume 176, Issue 12, 22 June 2006, Pages 1712–1738
نویسندگان
, ,