کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515656 867059 2012 20 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
A Bayesian feature selection paradigm for text classification
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
A Bayesian feature selection paradigm for text classification
چکیده انگلیسی

The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach.


► Feature subsets can be scored by a posterior distribution in text classification.
► We handle feature selection by introducing a latent vector in a generative model.
► A Metropolis search is suggested to find the best feature subset automatically.
► Real-life examples illustrate the dichotomization of words in text classification.
► Our method improves classification effectively and sharply reduces the feature set.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 48, Issue 2, March 2012, Pages 283–302
نویسندگان
, , , ,