کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
431867 | 688642 | 2013 | 11 صفحه PDF | دانلود رایگان |

Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf’s law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.
► We present a novel algorithm for dimensionality reduction in text retrieval.
► We report a strong reduction in features and a clear improvement in search quality.
► We formally derive a task and data parallelization of the serial algorithm.
► We give a concise definition of the parallel algorithm and analyze the complexity.
► We evaluate the performance on a small cluster and obtain a speed-up on real data.
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 3, March 2013, Pages 341–351