کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
515678 867069 2011 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Exploiting probabilistic topic models to improve text categorization under class imbalance
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Exploiting probabilistic topic models to improve text categorization under class imbalance
چکیده انگلیسی

In text categorization, it is quite often that the numbers of documents in different categories are different, i.e., the class distribution is imbalanced. We propose a unique approach to improve text categorization under class imbalance by exploiting the semantic context in text documents. Specifically, we generate new samples of rare classes (categories with relatively small amount of training data) by using global semantic information of classes represented by probabilistic topic models. In this way, the numbers of samples in different categories can become more balanced and the performance of text categorization can be improved using this transformed data set. Indeed, the proposed method is different from traditional re-sampling methods, which try to balance the number of documents in different classes by re-sampling the documents in rare classes. Such re-sampling methods can cause overfitting. Another benefit of our approach is the effective handling of noisy samples. Since all the new samples are generated by topic models, the impact of noisy samples is dramatically reduced. Finally, as demonstrated by the experimental results, the proposed methods can achieve better performance under class imbalance and is more tolerant to noisy samples.

Research highlights
► Propose two re-sampling methods based on probabilistic topic models.
► Improve text categorization under class imbalance.
► DECOM and DECODER achieve better performance under class imbalance.
► DECODER is more tolerant to noisy samples.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 47, Issue 2, March 2011, Pages 202–214
نویسندگان
, , , , ,