کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
6863745 1439520 2018 19 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Exploiting efficient and effective lazy Semi-Bayesian strategies for text classification
ترجمه فارسی عنوان
بهره گیری از استراتژی نیمه بیزی و تنبل کارآمد و موثر برای طبقه بندی متن
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر هوش مصنوعی
چکیده انگلیسی
Automatic Document Classification (ADC) has become the basis of many important applications, e.g., authorship identification, opinion mining, spam filtering, content organizers, etc. Due to their simplicity, efficiency, absence of parameters, and effectiveness in several scenarios, Naive Bayes (NB) approaches are widely used as a classification paradigm. Due to some characteristics of real document collections, e.g., class imbalance and feature sparseness, NB solutions do not present competitive effectiveness in some ADC tasks when compared to other supervised learning strategies, e.g., SVMs. In this article, we investigate whether a proper combination of some alternative NB learning models with different feature weighting techniques is able to improve the NB effectiveness in ADC tasks and verify that comparable or even superior results when compared to the state-of-the-art in ADC can be achieved. Moreover, we also present an investigation on the relaxation of the NB attribute independence assumption (aka, Semi-Naive approaches) in large text collections, something missing in the literature. Given the high computational costs of these investigations, we take advantage of current many core GPU and multi-GPU architectures to perform such investigation, presenting a massively parallelized version of the NB approach. Finally, supported by the parallel implementations, we propose four novel Lazy Semi-NB approaches to overcome potential overfitting problems. In our experiments, the new lazy solutions are not only more efficient and effective than existing Semi-NB approaches, but also surpass, in terms of effectiveness, all other alternatives in the majority of the cases.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Neurocomputing - Volume 307, 13 September 2018, Pages 153-171
نویسندگان
, , , , , , ,