کد مقاله | کد نشریه | سال انتشار | مقاله انگلیسی | نسخه تمام متن |
---|---|---|---|---|
523844 | 868506 | 2013 | 14 صفحه PDF | دانلود رایگان |

Partitioned global address space (PGAS) languages combine the convenient abstraction of shared memory with the notion of affinity, extending multi-threaded programming to large-scale systems with physically distributed memory. However, in spite of their obvious advantages, PGAS languages still lack appropriate tool support for performance analysis, one of the reasons why their adoption is still in its infancy. Some of the performance problems for which tool support is needed occur at the level of the underlying one-sided communication substrate, such as the Aggregate Remote Memory Copy Interface (ARMCI). One such example is the waiting time in situations where asynchronous data transfers cannot be completed without software intervention at the target side. This is not uncommon on systems with reduced operating-system kernels such as IBM Blue Gene/P where the use of progress threads would double the number of cores necessary to run an application. In this paper, we present an extension of the Scalasca trace-analysis infrastructure aimed at the identification and quantification of progress-related waiting times at larger scales. We demonstrate its utility and scalability using a benchmark running with up to 32,768 processes.
► Replay-based trace analysis to support passive target synchronization.
► Novel replay schemes for the efficient exchange of performance relevant information.
► Revealing significant impact of absence of remote progress on performance of one-sided communication in polling scenarios.
► Strong and weak scaling measurements on up to 32,768 processes with application kernels.
► Performance measurement of NWChem simulation on 4,096 processes.
Journal: Parallel Computing - Volume 39, Issue 3, March 2013, Pages 132–145