Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
6940237 | Pattern Recognition Letters | 2018 | 7 Pages |
Abstract
Term weighting is known as a text presentation strategy to assign appropriate value to each term to improve the performance of text classification in the task of transforming the content of textual document into a vector in the term space. Supervised weighting methods using the information on the membership of training documents in predefined classes are naturally expected to provide better results than the unsupervised ones. In this paper, a new weighting scheme is proposed via a matching score function based on a probabilistic model. We introduce a latent variable to indicate whether a term contains text classification information or not, specify conjugate priors and exploit the conjugacy by integrating out the latent indicator and the parameters. Then the non-discriminating terms can be assigned weights close to 0. Experimental results using kNN and SVM classifiers illustrate the effectiveness of the proposed approach on both small and large text data sets.
Keywords
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Vision and Pattern Recognition
Authors
Guozhong Feng, Shaoting Li, Tieli Sun, Bangzuo Zhang,