کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
432371 688869 2013 14 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
چکیده انگلیسی


• Evaluated the design space of heterogeneous execution with a complex application.
• Designed a new memory management framework across host and accelerator memories.
• Naive hybrid execution of CPU plus GPUs slows down rather than speeds up execution.
• Optimized hybrid execution made faster than homogeneous CPU-only/GPU-only execution.
• Excellent scaling achieved on up to 940 GPUs plus 15,040 cores in a single execution.

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013)  [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 12, December 2013, Pages 1578–1591
نویسندگان
, ,