Text classification with the support of pruned dependency patterns

Article ID	Journal	Published Year	Pages	File Type
536063	Pattern Recognition Letters	2010	10 Pages	PDF

Abstract

We propose a novel text classification approach based on two main concepts, lexical dependency and pruning. We extend the standard bag-of-words method by including dependency patterns in the feature vector. We perform experiments with 37 lexical dependencies and the effect of each dependency type is analyzed separately in order to identify the most discriminative dependencies. We analyze the effect of pruning (filtering features with low frequencies) for both word features and dependency features. Parameter tuning is performed with eight different pruning levels to determine the optimal levels. The experiments were repeated on three datasets with different characteristics. We observed a significant improvement on the success rates as well as a reduction on the dimensionality of the feature vector. We argue that, in contrast to the works in the literature, a much higher pruning level should be used in text classification. By analyzing the results from the dataset perspective, we also show that datasets in similar formality levels have similar leading dependencies and show close behavior with varying pruning levels.

Keywords

Text classification