Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
523829	868503	2016	13 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Compiler optimization - بهینه سازی کامپایلر

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نرم افزارهای علوم کامپیوتر

پیش نمایش صفحه اول مقاله

Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

چکیده انگلیسی

• We improve performance of fine-grain UPC applications by orders of magnitude.
• We introduce a novel shared-data localization transformation.
• We present a thorough performance analysis and evaluation.
• We show that reducing run-time calls is crucial for performance.
• We achieve performance comparable to C and MPI using the UPC programming model.

Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector-executor transformation results in excessive instrumentation that hinders performance.This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 × their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 × for applications with irregular accesses.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Parallel Computing - Volume 54, May 2016, Pages 2–14

نویسندگان

Michail Alvanos, Ettore Tiotto, José Nelson Amaral, Montse Farreras, Xavier Martorell,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

دسترسی سریع

ارتباط

English Website