Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
488164 | Procedia Computer Science | 2011 | 10 Pages |
This paper presents results of our study on double-precision general matrix-matrix multiplication (DGEMM) for GPU-equipped systems. We applied further optimization to utilize the DGEMM stream kernel previously implemented for a Cypress GPU from AMD. We have examined the effects of different memory access patterns to the performance of the DGEMM kernel by changing its layout function. The experimental results show that the GEMM kernel with X-Morton layout function superiors to the one with any other functions in terms of performance and cache hit rate. Moreover, we have implemented a DGEMM routine for large matrices, where all data cannot be allocated in a GPU memory. Our DGEMM performance achieves up to 472 GFlop/s and 921 GFlop/s on a system, using one GPU and two GPUs, respectively.