Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems

Article ID	Journal	Published Year	Pages	File Type
488164	Procedia Computer Science	2011	10 Pages	PDF

Abstract

This paper presents results of our study on double-precision general matrix-matrix multiplication (DGEMM) for GPU-equipped systems. We applied further optimization to utilize the DGEMM stream kernel previously implemented for a Cypress GPU from AMD. We have examined the effects of different memory access patterns to the performance of the DGEMM kernel by changing its layout function. The experimental results show that the GEMM kernel with X-Morton layout function superiors to the one with any other functions in terms of performance and cache hit rate. Moreover, we have implemented a DGEMM routine for large matrices, where all data cannot be allocated in a GPU memory. Our DGEMM performance achieves up to 472 GFlop/s and 921 GFlop/s on a system, using one GPU and two GPUs, respectively.