کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
524648 | 868800 | 2013 | 11 صفحه PDF | دانلود رایگان |

• Analysis of memory allocators on SMPs with up to 512 cores.
• We measured NUMA traffic to quantify how well allocators preserve memory locality.
• Sixfold speedup with a basic custom allocator on top of the stock one.
• Optimized MapReduce framework for large shared-memory machines.
• We verified the SMP results with an MPI/OpenMP implementation on the same hardware.
The standard memory allocators of shared memory systems (SMPs) often provide poor performance, because they do not sufficiently reflect the access latencies of deep NUMA architectures with their on-chip, off-chip, and off-blade communication. We analyze memory allocation strategies for data-intensive MapReduce applications on SMPs with up to 512 cores and 2 TB memory. We compare the efficiency of the MapReduce frameworks MR-Search and Phoenix++ and provide performance results on two benchmark applications, k-means and shortest-path search.Already on small SMPs with 128 cores a 6-fold speedup can be achieved by replacing the standard glibc by allocators with pooling strategies. These savings become more pronounced on larger SMPs. We identify two types of overhead: (1) the cost for executing the malloc/free operations and (2) the poor memory locality caused by an ineffective mapping to the underlying memory hierarchy. We give detailed results on the NUMA traffic and show how the cost increases on large SMPs with many cores and a deep NUMA hierarchy.For verification, we run hybrid MPI/OpenMP implementations of the same benchmarks on systems with explicit message passing. The results reveal that neither the hardware nor the Linux kernel constitutes a bottleneck, but only the poor locality of the allocated memory pages.
Journal: Parallel Computing - Volume 39, Issue 12, December 2013, Pages 879–889