Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
431867	688642	2013	11 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

parallel algorithms - الگوریتم های موازی Information retrieval - بازیابی اطلاعات Dimensionality reduction - کاهش ابعاد، فروکاهی ابعاد

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

چکیده انگلیسی

Dimensionality reduction is an established area in text mining and information retrieval. These methods convert the highly sparse corpus matrices into dense matrix format while preserving or improving the classification accuracy or retrieval performance. In this paper, we describe a novel approach to dimensionality reduction for text, along with a parallel algorithm suitable for private memory parallel computer systems. According to Zipf’s law, the majority of indexing terms occurs only in a small number of documents. Our algorithm replaces rare terms by computing a vector which expresses their semantics in terms of common terms. This process produces a projection matrix, which can be applied to a corpus matrix and individual document and query vectors. We give an accurate mathematical and algorithmic description of our algorithms and present an experimental evaluation on two benchmark corpora. These experiments indicate that our algorithm can deliver a substantial reduction in the number of features, from 47,236 to 392 features on the Reuters corpus with a clear improvement in the retrieval performance. We have evaluated our parallel implementation using the message passing interface with up to 32 processes on a Nehalem Xeon cluster, computing the projection matrix for the dimensionality reduction for over 800,000 documents in just under 100 s.

► We present a novel algorithm for dimensionality reduction in text retrieval.
► We report a strong reduction in features and a clear improvement in search quality.
► We formally derive a task and data parallelization of the serial algorithm.
► We give a concise definition of the parallel algorithm and analyze the complexity.
► We evaluate the performance on a small cluster and obtain a speed-up on real data.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 3, March 2013, Pages 341–351

نویسندگان

T. Berka, M. Vajteršic,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Parallel rare term vector replacement: Fast and effective dimensionality reduction for text

دسترسی سریع

ارتباط

English Website