Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
865567 | Tsinghua Science & Technology | 2009 | 8 Pages |
Abstract
Chinese text categorization differs from English text categorization due to its much larger term set (of words or character n-grams), which results in very slow training and working of modern high-performance classifiers. This study assumes that this high-dimensionality problem is related to the redundancy in the term set, which cannot be solved by traditional term selection methods. A greedy algorithm framework named “non-independent term selection” is presented, which reduces the redundancy according to string-level correlations. Several preliminary implementations of this idea are demonstrated. Experiment results show that a good tradeoff can be reached between the performance and the size of the term set.
Keywords
Related Topics
Physical Sciences and Engineering
Engineering
Engineering (General)
Authors
Li (ææ¯é³), Sun (åèæ¾),