Article ID Journal Published Year Pages File Type
6854663 Expert Systems with Applications 2018 61 Pages PDF
Abstract
Text classification automatically assigns text documents to one or more predefined categories based on their content. In text classification, data are characterized by a large number of highly sparse terms and highly skewed categories. Working with all the terms in the data has an adverse impact on the accuracy and efficiency of text classification tasks. A feature selection algorithm helps in selecting the most relevant terms. In this paper, we propose a new feature ranking metric called max-min ratio (MMR). It is a product of max-min ratio of the true positives and false positives and their difference, which allows MMR to select smaller subsets of more relevant terms even in the presence of highly skewed classes. This results in performing text classification with higher accuracy and more efficiency. To investigate the effectiveness of our newly proposed metric, we compare its performance against eight metrics (balanced accuracy measure, information gain, chi-squared, Poisson ratio, Gini index, odds ratio, distinguishing feature selector, and normalized difference measure) on six data sets namely WebACE (WAP, K1a, K1b), Reuters (RE0, RE1), and 20 Newsgroups using the multinomial naive Bayes (MNB) and support vector machines (SVM) classifiers. The statistical significance of MMR has been estimated on 5 different splits of training and test data sets using the one-way analysis of variance (ANOVA) method and a multiple comparisons test based on Tukey-Kramer method. We found that performance of MMR is statistically significant than that of the other 8 metrics in 76.2% cases in terms of macro F1 measure and in 74.4% cases in terms of micro F1 measure.
Related Topics
Physical Sciences and Engineering Computer Science Artificial Intelligence
Authors
, , , ,