کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
385796 660872 2011 9 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
پیش نمایش صفحه اول مقاله
Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets
چکیده انگلیسی

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Expert Systems with Applications - Volume 38, Issue 3, March 2011, Pages 2072–2080
نویسندگان
, , ,