Unsupervised feature selection through Gram-Schmidt orthogonalization-A word co-occurrence perspective

Article ID	Journal	Published Year	Pages	File Type
10326425	Neurocomputing	2016	30 Pages	PDF

Abstract

Feature selection is a key step in many machine learning applications, such as categorization, and clustering. Especially for text data, the original document-term matrix is high-dimensional and sparse, which affects the performance of feature selection algorithms. Meanwhile, labeling training instance is time-consuming and expensive. So unsupervised feature selection algorithms have attracted more attention. In this paper, we propose an unsupervised feature selection algorithm through RÌ² andom PÌ² rojection and GÌ² ram-GÌ² chmidt OÌ² rthogonalization (RP-GSO) from the word co-occurrence matrix. The RP-GSO algorithm has three advantages: (1) it takes as input dense word co-occurrence matrix, avoiding the sparseness of original document-term matrix; (2) it selects “basis features” by Gram-Schmidt process, guaranteeing the orthogonalization of feature space; and (3) it adopts random projection to speed up GS process. Extensive experimental results show our proposed RP-GSO approach achieves better performance comparing against supervised and unsupervised feature selection methods in text classification and clustering tasks.

Keywords

Gram–Schmidt orthogonalization Feature selection Random Projection