Efficient sparse matrix–vector multiplication using cache oblivious extension quadtree storage format

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
424559	685587	2016	11 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

Data locality - محل داده ها

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

Efficient sparse matrix–vector multiplication using cache oblivious extension quadtree storage format

چکیده انگلیسی

• A cache oblivious extension quadtree storage structure (COEQTCOEQT) is proposed.
• Present the converting algorithm from CSRCSR storage format to the COEQTCOEQT storage format.
• Implement the sparse matrix–vector multiplication algorithm based on the COEQTCOEQT.
• Optimize the performance of the implemented algorithm through manual vectorization.
• Implement the parallel sparse matrix–vector multiplication in distributed system.

In this paper, we elaborate on improving the sparse matrix storage format to optimize the data locality of sparse matrix–vector multiplication (SpMVMSpMVM) algorithm, and its parallel performance. First of all, we propose a cache oblivious extension quadtree storage structure (COEQTCOEQT), in which the sparse matrix is recursively divided into sub-regions that can well fit into cache to improve the data locality. Later on, we present a COEQTCOEQT based SpMVMSpMVM algorithm and optimize its performance through manual vectorization. With this storage format, the original SpMVMSpMVM is divided into computations of relatively independent small matrices. In addition, this region-based computation framework is also suitable for high performance computing in distributed computing environment. So, we finally present a parallel SpMVMSpMVM algorithm based on the proposed COEQTCOEQT. Extensive and comprehensive experiments show that the sparse matrix–vector multiplication using the COEQTCOEQT storage format achieves on average 1.1–1.5×× speedup compared with CSRCSR format and further higher performance through instruction level optimization techniques. The experiment in Lenovo Deepcomp 7000 demonstrates that this method achieves on average 1.63×1.63× speedup compared with the Intel Cluster Math Kernel Library implementation.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Future Generation Computer Systems - Volume 54, January 2016, Pages 490–500

نویسندگان

Jilin Zhang, Jian Wan, Fangfang Li, Jie Mao, Li Zhuang, Junfeng Yuan, Enyi Liu, Zhuoer Yu,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

Efficient sparse matrix–vector multiplication using cache oblivious extension quadtree storage format

دسترسی سریع

ارتباط

English Website