A Bayesian feature selection paradigm for text classification

Article ID	Journal	Published Year	Pages	File Type
515656	Information Processing & Management	2012	20 Pages	PDF

Abstract

The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.

► Feature subsets can be scored by a posterior distribution in text classification. ► We handle feature selection by introducing a latent vector in a generative model. ► A Metropolis search is suggested to find the best feature subset automatically. ► Real-life examples illustrate the dichotomization of words in text classification. ► Our method improves classification effectively and sharply reduces the feature set.

Keywords

Text classification Mixture model