کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
433011 689201 2015 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
GOM-Hadoop: A distributed framework for efficient analytics on ordered datasets
چکیده انگلیسی


• We generalize a class of big data analytics workload (Re-Org) on ordered datasets.
• We propose a novel distributed mechanism for efficiently executing Re-Org tasks.
• The proposed mechanism is implemented in a distributed framework by extending Hadoop.
• A model is presented to formally study the proposed framework.
• Experiments show that our framework is 6.3x faster than vanilla Hadoop.

One of the most common datasets exploited by many corporations to conduct business intelligence analysis is event log files. Oftentimes, the records in event log files are temporally ordered, and need to be grouped by certain key with the temporal ordering preserved to facilitate further analysis. One such example is to group temporally ordered events by user ID in order to analyze user behavior. This kind of analytical workload, here referred to as RElative Order-pReserving based Grouping (Re-Org), is quite common in big data analytics, where the MapReduce programming paradigm (and its open-source implementation, Hadoop) is widely adopted for massive parallel processing. However, using MapReduce/Hadoop for executing Re-Org tasks on ordered datasets is not efficient due to its internal sort–merge mechanism when shuffling data from mappers to reducers. In this paper, we propose a distributed framework that adopts an efficient group-order–merge mechanism to speed up the execution of Re-Org tasks. We demonstrate the advantage of our framework by formally modeling its execution process and by comparing its performance with Hadoop through extensive experiments on real-world datasets. The evaluation results show that our framework can achieve up to 6.3x speedup over Hadoop in executing Re-Org tasks.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 83, September 2015, Pages 58–69
نویسندگان
, , , , ,