Exploiting task and data parallelism in ILUPACK’s preconditioned CG solver on NUMA architectures and many-core accelerators

Article ID	Journal	Published Year	Pages	File Type
523836	Parallel Computing	2016	11 Pages	PDF

Abstract

•Specialized implementations of ILUPACK’s iterative solver for NUMA platforms.•Specialized implementations of ILUPACK’s iterative solver for many-core accelerators.•Exploitation of task parallelism via OmpSs runtime (dynamic schedule).•Exploitation of task parallelism via MPI (static schedule).•Exploitation of data parallelism for GPUs.

We present specialized implementations of the preconditioned iterative linear system solver in ILUPACK for Non-Uniform Memory Access (NUMA) platforms and many-core hardware co-processors based on the Intel Xeon Phi and graphics accelerators. For the conventional x86 architectures, our approach exploits task parallelism via the OmpSs runtime as well as a message-passing implementation based on MPI, respectively yielding a dynamic and static schedule of the work to the cores, with different numeric semantics to those of the sequential ILUPACK. For the graphics processor we exploit data parallelism by off-loading the computationally expensive kernels to the accelerator while keeping the numeric semantics of the sequential case.

Keywords

Graphics processing units (GPUs)Intel Xeon Phi Sparse linear systems Multi-core processors