کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
534282 | 870244 | 2014 | 10 صفحه PDF | دانلود رایگان |
• We prove the frequency distribution of a term is approximately normally distributed.
• We model the diversity of the frequency of a term with t-test.
• We verify our approach on two text corpora with three classifiers.
• Our approach is comparable to or even better than the state-of-the-art methods.
Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus.Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t -test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1F1 and micro-F1F1. Especially on micro-F1F1, our method achieves slightly better performance on Reuters with k NN and SVMs classifiers, compared to χ2χ2, and IG.
Journal: Pattern Recognition Letters - Volume 45, 1 August 2014, Pages 1–10