کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
404987 | 677469 | 2015 | 13 صفحه PDF | دانلود رایگان |
Feature selection is often used in email classification to reduce the dimensionality of the feature space. In this study, a new document frequency and term frequency combined feature selection method (DTFS) is proposed to improve the performance of email classification. Firstly, an existing optimal document frequency based feature selection method (ODFFS) and a predetermined threshold are applied to select the most discriminative features. Secondly, an existing optimal term frequency based feature selection (OTFFS) method and another predetermined threshold are applied to select more discriminative features. Finally, ODFFS and OTFFS are combined to select the remaining features. In order to improve the convergence rate of parameter optimization, a metaheuristic method, namely global best harmony oriented harmony search (GBHS), is proposed to search these optimal predetermined thresholds. Experiments with fuzzy Support Vector Machine (FSVM) and Naïve Bayesian (NB) classifiers are applied on six corpuses: PU2, CSDMC2010, PU3, Lingspam, Enron-spam and Trec2007. Experimental results show that, DTFS outperforms other methods: such as Chi-squre, comprehensively measure feature selection, t-test based feature selection, term frequency based information gain, two-step based hybrid feature selection method and improved term frequency inverse document frequency method on six corpuses.
Journal: Knowledge-Based Systems - Volume 73, January 2015, Pages 311–323