From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
524340	868615	2012	17 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Auto-tuning - تنظیم Hardware accelerators - شتاب دهنده های سخت افزاری Portability - قابل حمل بودن

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

چکیده انگلیسی

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose the use of auto-tuning to better explore these kernels’ parameter space using search harness.

► GPU accelerators from NVIDIA and ATI significantly increase application performance.
► Achieving performance across OpenCL and CUDA programming frameworks is not trivial.
► Low-level languages achieve 80% of peak performance on multicores and accelerators.
► High-level languages achieve 50% of peak performance on multicores and accelerators.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 38, Issue 8, August 2012, Pages 391–407

نویسندگان

Peng Du, Rick Weber, Piotr Luszczek, Stanimire Tomov, Gregory Peterson, Jack Dongarra,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

دسترسی سریع

ارتباط

English Website