Article ID Journal Published Year Pages File Type
865567 Tsinghua Science & Technology 2009 8 Pages PDF
Abstract
Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named “non-independent term selection” is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set.
Keywords
Related Topics
Physical Sciences and Engineering Engineering Engineering (General)
Authors
, ,