Selection of the most relevant terms based on a max-min ratio metric for text classification

Article ID	Journal	Published Year	Pages	File Type
6854663	Expert Systems with Applications	2018	61 Pages	PDF

Abstract

Text classification automatically assigns text documents to one or more predefined categories based on their content. In text classification, data are characterized by a large number of highly sparse terms and highly skewed categories. Working with all the terms in the data has an adverse impact on the accuracy and efficiency of text classification tasks. A feature selection algorithm helps in selecting the most relevant terms. In this paper, we propose a new feature ranking metric called max-min ratio (MMR). It is a product of max-min ratio of the true positives and false positives and their difference, which allows MMR to select smaller subsets of more relevant terms even in the presence of highly skewed classes. This results in performing text classification with higher accuracy and more efficiency. To investigate the effectiveness of our newly proposed metric, we compare its performance against eight metrics (balanced accuracy measure, information gain, chi-squared, Poisson ratio, Gini index, odds ratio, distinguishing feature selector, and normalized difference measure) on six data sets namely WebACE (WAP, K1a, K1b), Reuters (RE0, RE1), and 20 Newsgroups using the multinomial naive Bayes (MNB) and support vector machines (SVM) classifiers. The statistical significance of MMR has been estimated on 5 different splits of training and test data sets using the one-way analysis of variance (ANOVA) method and a multiple comparisons test based on Tukey-Kramer method. We found that performance of MMR is statistically significant than that of the other 8 metrics in 76.2% cases in terms of macro F1 measure and in 74.4% cases in terms of micro F1 measure.

Keywords

Feature selection Text classification