Article ID | Journal | Published Year | Pages | File Type |
---|---|---|---|---|
486771 | Procedia Computer Science | 2010 | 10 Pages |
Abstract
In this article, we present a fast algorithm for matrix multiplication optimized for recent multicore architectures. The implementation exploits different methodologies from parallel programming, like recursive decomposition, efficient low-level implementations of basic blocks, software prefetching, and task scheduling resulting in a multilevel algorithm with adaptive features. Measurements on different systems and comparisons with GotoBLAS, Intel Math Kernel Library (IMKL), and AMD Core Math Library (AMCL) show that the matrix implementation presented has a very high efficiency.
Related Topics
Physical Sciences and Engineering
Computer Science
Computer Science (General)