کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
432654 689006 2016 10 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Transforming the multifluid PPM algorithm to run on GPUs
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
Transforming the multifluid PPM algorithm to run on GPUs
چکیده انگلیسی


• An optimization for limited workspace on the GPUs.
• Allowing trade-off between workspace size and redundant computation.
• Automatic translators to automate the optimizations.
• Delivered 1.7 to 2.4 times speedups compared to the CPU systems.
• Superior or comparable performance compared to other CFDs running on GPUs.

In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts of data in on-chip storage. In previous work, we combined a briquette data structure and a heavily pipelined CFD processing of these data briquettes in sequence that results in a very small on-chip data workspace and high performance for our multifluid PPM gas dynamics algorithm on CPUs with standard sized caches. The on-chip data workspace produced in that earlier work is not small enough to meet the requirements of today’s GPUs, which demand that no more than 32 kB of on-chip data be associated with a single thread of control (a warp). Here we report a variant of our earlier technique that allows a user-controllable trade-off between workspace size and redundant computation that can be a win on GPUs. We use our multifluid PPM gas dynamics algorithm to illustrate this technique. Performance results for this algorithm in 32-bit precision on a recently introduced dual-chip GPU, the Nvidia K80, are 1.7 times that on a similarly recent dual CPU node using two 16-core Intel Haswell chips. The redundant computation that allows the on-chip data context for each thread of control to be less than 32 kB is roughly 9% of the total. We have built an automatic translator from a Fortran expression to CUDA to ease the programming burden that is involved in applying our technique.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Journal of Parallel and Distributed Computing - Volumes 93–94, July 2016, Pages 56–65
نویسندگان
, ,