Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
10326425 | Neurocomputing | 2016 | 30 Pages |
Abstract
Feature selection is a key step in many machine learning applications, such as categorization, and clustering. Especially for text data, the original document-term matrix is high-dimensional and sparse, which affects the performance of feature selection algorithms. Meanwhile, labeling training instance is time-consuming and expensive. So unsupervised feature selection algorithms have attracted more attention. In this paper, we propose an unsupervised feature selection algorithm through R̲ andom P̲ rojection and G̲ ram-G̲ chmidt O̲ rthogonalization (RP-GSO) from the word co-occurrence matrix. The RP-GSO algorithm has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix; (2) it selects “basis features” by Gram-Schmidt process, guaranteeing the orthogonalization of feature space; and (3) it adopts random projection to speed up GS process. Extensive experimental results show our proposed RP-GSO approach achieves better performance comparing against supervised and unsupervised feature selection methods in text classification and clustering tasks.
Related Topics
Physical Sciences and Engineering
Computer Science
Artificial Intelligence
Authors
Deqing Wang, Hui Zhang, Rui Liu, Xianglong Liu, Jing Wang,