Efficient 3D stencil computations using CUDA

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
523948	868534	2013	16 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Multigrid Stencil computation CUDA - کودا. پردازش موازی و مدل برنامه‌نویسی

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

Efficient 3D stencil computations using CUDA

چکیده انگلیسی

• 12 Gpts/s throughput of 7-point stencil on Tesla C2050.
• Effectively only 10% memory overhead related to the data block halos.
• 95% of benchmarked memory bandwidth.
• 450 GFLOP/s for compute bound general 27-point stencil.

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5 Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 39, Issue 10, October 2013, Pages 533–548

نویسندگان

Marcin Krotkiewski, Marcin Dabrowski,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Efficient 3D stencil computations using CUDA

دسترسی سریع

ارتباط

English Website