MapReduce indexing strategies: Studying scalability and efficiency

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
515639	867057	2012	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Efficiency - بازده Scalability - مقیاس پذیری Indexing - نمایه سازی MapReduce - نگاشت کاهش Hadoop - هادوپ

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

MapReduce indexing strategies: Studying scalability and efficiency

چکیده انگلیسی

In Information Retrieval (IR), the efficient indexing of terabyte-scale and larger corpora is still a difficult problem. MapReduce has been proposed as a framework for distributing data-intensive operations across multiple processing machines. In this work, we provide a detailed analysis of four MapReduce indexing strategies of varying complexity. Moreover, we evaluate these indexing strategies by implementing them in an existing IR framework, and performing experiments using the Hadoop MapReduce implementation, in combination with several large standard TREC test corpora. In particular, we examine the efficiency of the indexing strategies, and for the most efficient strategy, we examine how it scales with respect to corpus size, and processing power. Our results attest to both the importance of minimising data transfer between machines for IO intensive tasks like indexing, and the suitability of the per-posting list MapReduce indexing strategy, in particular for indexing at a terabyte-scale. Hence, we conclude that MapReduce is a suitable framework for the deployment of large-scale indexing.

Research highlights
► We compare and contrast four MapReduce indexing strategies using terabyte-scale data.
► Per-token and per-term indexing strategies are impractical for indexing at terabyte-scale.
► Per-document indexing, while more efficient than per-term/token strategies, is compromised by lengthy reduce phases.
► Per-posting list indexing is the most efficient strategy, due to decreased data traffic between map and reduce tasks.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Information Processing & Management - Volume 48, Issue 5, September 2012, Pages 873–888

نویسندگان

Richard McCreadie, Craig Macdonald, Iadh Ounis,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

MapReduce indexing strategies: Studying scalability and efficiency

دسترسی سریع

ارتباط

English Website