Article ID Journal Published Year Pages File Type
488164 Procedia Computer Science 2011 10 Pages PDF
Abstract

This paper presents results of our study on double-precision general matrix-matrix multiplication (DGEMM) for GPU-equipped systems. We applied further optimization to utilize the DGEMM stream kernel previously implemented for a Cypress GPU from AMD. We have examined the effects of different memory access patterns to the performance of the DGEMM kernel by changing its layout function. The experimental results show that the GEMM kernel with X-Morton layout function superiors to the one with any other functions in terms of performance and cache hit rate. Moreover, we have implemented a DGEMM routine for large matrices, where all data cannot be allocated in a GPU memory. Our DGEMM performance achieves up to 472 GFlop/s and 921 GFlop/s on a system, using one GPU and two GPUs, respectively.

Related Topics
Physical Sciences and Engineering Computer Science Computer Science (General)