Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
432371	688869	2013	14 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Linear algebra - جبر خطی Accelerator - شتاب دهنده Memory management - مدیریت حافظه GPU - واحد پردازش گرافیکی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

چکیده انگلیسی

• Evaluated the design space of heterogeneous execution with a complex application.
• Designed a new memory management framework across host and accelerator memories.
• Naive hybrid execution of CPU plus GPUs slows down rather than speeds up execution.
• Optimized hybrid execution made faster than homogeneous CPU-only/GPU-only execution.
• Excellent scaling achieved on up to 940 GPUs plus 15,040 cores in a single execution.

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block tridiagonal solver. The accelerator of each compute node is exploited in combination with multicore processors of that node in performing block-level linear algebra operations in the overall, distributed solver algorithm. Optimizations incorporated include: (1) an efficient memory mapping and synchronization interface to minimize data movement, (2) multi-process sharing of the accelerator within a node to obtain balanced load with multicore processors, and (3) an automatic memory management system to efficiently utilize accelerator memory when sub-matrices spill over the limits of device memory. Results are reported from our novel implementation that uses MAGMA and CUBLAS accelerator software systems simultaneously with ACML (2013) [2] for multithreaded execution on processors. Overall, using 940 nVidia Tesla X2090 accelerators and 15,040 cores, the best heterogeneous execution delivers a 10.9-fold reduction in run time relative to an already efficient parallel multicore-only baseline implementation that is highly optimized with intra-node and inter-node concurrency and computation–communication overlap. Detailed quantitative results are presented to explain all critical runtime components contributing to hybrid performance.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volume 73, Issue 12, December 2013, Pages 1578–1591

نویسندگان

Alfred J. Park, Kalyan S. Perumalla,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

دسترسی سریع

ارتباط

English Website