t-Test feature selection approach based on term frequency for text categorization

Article ID	Journal	Published Year	Pages	File Type
534282	Pattern Recognition Letters	2014	10 Pages	PDF

Abstract

•We prove the frequency distribution of a term is approximately normally distributed.•We model the diversity of the frequency of a term with t-test.•We verify our approach on two text corpora with three classifiers.•Our approach is comparable to or even better than the state-of-the-art methods.

Feature selection techniques play an important role in text categorization (TC), especially for the large-scale TC tasks. Many new and improved methods have been proposed, and most of them are based on document frequency, such as the famous Chi-square statistic and information gain etc. These methods based on document frequency, however, have two shortcomings: (1) they are not reliable for low-frequency terms, that is, low-frequency terms will be filtered because of their smaller weights; and (2) they only count whether one term occurs within a document and ignore term frequency. Actually, high-frequency term (except stop words) occurred in few documents is often regards as a discriminators in the real-life corpus.Aimed at solving the above drawbacks, the paper focuses on how to construct a feature selection function based on term frequency, and proposes a new approach using student t-test. The t -test function is used to measure the diversity of the distributions of a term frequency between the specific category and the entire corpus. Extensive comparative experiments on two text corpora using three classifiers show that the proposed approach is comparable to the state-of-the-art feature selection methods in terms of macro-F1F1 and micro-F1F1. Especially on micro-F1F1, our method achieves slightly better performance on Reuters with k NN and SVMs classifiers, compared to χ2χ2, and IG.

Keywords

Feature selection Text classification Term Frequency