کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
524648 868800 2013 11 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Analyzing the performance of SMP memory allocators with iterative MapReduce applications
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر
پیش نمایش صفحه اول مقاله
Analyzing the performance of SMP memory allocators with iterative MapReduce applications
چکیده انگلیسی


• Analysis of memory allocators on SMPs with up to 512 cores.
• We measured NUMA traffic to quantify how well allocators preserve memory locality.
• Sixfold speedup with a basic custom allocator on top of the stock one.
• Optimized MapReduce framework for large shared-memory machines.
• We verified the SMP results with an MPI/OpenMP implementation on the same hardware.

The standard memory allocators of shared memory systems (SMPs) often provide poor performance, because they do not sufficiently reflect the access latencies of deep NUMA architectures with their on-chip, off-chip, and off-blade communication. We analyze memory allocation strategies for data-intensive MapReduce applications on SMPs with up to 512 cores and 2 TB memory. We compare the efficiency of the MapReduce frameworks MR-Search and Phoenix++ and provide performance results on two benchmark applications, k-means and shortest-path search.Already on small SMPs with 128 cores a 6-fold speedup can be achieved by replacing the standard glibc by allocators with pooling strategies. These savings become more pronounced on larger SMPs. We identify two types of overhead: (1) the cost for executing the malloc/free operations and (2) the poor memory locality caused by an ineffective mapping to the underlying memory hierarchy. We give detailed results on the NUMA traffic and show how the cost increases on large SMPs with many cores and a deep NUMA hierarchy.For verification, we run hybrid MPI/OpenMP implementations of the same benchmarks on systems with explicit message passing. The results reveal that neither the hardware nor the Linux kernel constitutes a bottleneck, but only the poor locality of the allocated memory pages.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 39, Issue 12, December 2013, Pages 879–889
نویسندگان
, , ,