Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
515656 | Information Processing & Management | 2012 | 20 Pages |
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.
► Feature subsets can be scored by a posterior distribution in text classification. ► We handle feature selection by introducing a latent vector in a generative model. ► A Metropolis search is suggested to find the best feature subset automatically. ► Real-life examples illustrate the dichotomization of words in text classification. ► Our method improves classification effectively and sharply reduces the feature set.