Exploiting efficient and effective lazy Semi-Bayesian strategies for text classification

Article ID	Journal	Published Year	Pages	File Type
6863745	Neurocomputing	2018	19 Pages	PDF

Abstract

Automatic Document Classification (ADC) has become the basis of many important applications, e.g., authorship identification, opinion mining, spam filtering, content organizers, etc. Due to their simplicity, efficiency, absence of parameters, and effectiveness in several scenarios, Naive Bayes (NB) approaches are widely used as a classification paradigm. Due to some characteristics of real document collections, e.g., class imbalance and feature sparseness, NB solutions do not present competitive effectiveness in some ADC tasks when compared to other supervised learning strategies, e.g., SVMs. In this article, we investigate whether a proper combination of some alternative NB learning models with different feature weighting techniques is able to improve the NB effectiveness in ADC tasks and verify that comparable or even superior results when compared to the state-of-the-art in ADC can be achieved. Moreover, we also present an investigation on the relaxation of the NB attribute independence assumption (aka, Semi-Naive approaches) in large text collections, something missing in the literature. Given the high computational costs of these investigations, we take advantage of current many core GPU and multi-GPU architectures to perform such investigation, presenting a massively parallelized version of the NB approach. Finally, supported by the parallel implementations, we propose four novel Lazy Semi-NB approaches to overcome potential overfitting problems. In our experiments, the new lazy solutions are not only more efficient and effective than existing Semi-NB approaches, but also surpass, in terms of effectiveness, all other alternatives in the majority of the cases.

Keywords

GPU parallelization Naïve Bayes classifier Text classification