Highly parallel GEMV with register blocking method on GPU architecture

Article ID	Journal	Published Year	Pages	File Type
529785	Journal of Visual Communication and Image Representation	2014	8 Pages	PDF

Abstract

•We propose a register blocking method for GEMV on GPU.•The proposed method can improve the parallelism and reuse data on chip at the same time.•Different block sizes are tested to found the best block size on a GPU platform.

GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.

Keywords

Memory bandwidth Data reuse Many-core Parallelization GPU CUDA