کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
524684 | 868824 | 2011 | 23 صفحه PDF | دانلود رایگان |

We describe a parallel library written with message-passing (MPI) calls that allows algorithms to be expressed in the MapReduce paradigm. This means the calling program does not need to include explicit parallel code, but instead provides “map” and “reduce” functions that operate independently on elements of a data set distributed across processors. The library performs needed data movement between processors. We describe how typical MapReduce functionality can be implemented in an MPI context, and also in an out-of-core manner for data sets that do not fit within the aggregate memory of a parallel machine. Our motivation for creating this library was to enable graph algorithms to be written as MapReduce operations, allowing processing of terabyte-scale data sets on traditional MPI-based clusters. We outline MapReduce versions of several such algorithms: vertex ranking via PageRank, triangle finding, connected component identification, Luby’s algorithm for maximally independent sets, and single-source shortest-path calculation. To test the algorithms on arbitrarily large artificial graphs we generate randomized R-MAT matrices in parallel; a MapReduce version of this operation is also described. Performance and scalability results for the various algorithms are presented for varying size graphs on a distributed-memory cluster. For some cases, we compare the results with non-MapReduce algorithms, different machines, and different MapReduce software, namely Hadoop. Our open-source library is written in C++, is callable from C++, C, Fortran, or scripting languages such as Python, and can run on any parallel platform that supports MPI.
Research highlights
► We show how MapReduce operations can be performed on top of the message-passing interface (MPI) in parallel and out-of-core, and describe our open-source implementation, called the MR-MPI library.
► We describe MapReduce algorithms for several graph computations: generation of random R-MAT graphs (or matrices), vertex ranking via PageRank, triangle and connected-component finding, maximal independent set identification, and single-source shortest-path.
► We give performance data for running these algorithms on graphs with up to billions of edges.
Journal: Parallel Computing - Volume 37, Issue 9, September 2011, Pages 610–632