کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
462647 696882 2015 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Modular vector processor architecture targeting at data-level parallelism
ترجمه فارسی عنوان
معماری پردازنده بردار مدولار که هدف آن در همگرایی سطح داده است
کلمات کلیدی
همبستگی پردازنده وکتور، کارایی، سرعت دادن، معیار سنجش
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر شبکه های کامپیوتری و ارتباطات
چکیده انگلیسی

Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. We present an innovative architecture for a VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In our lane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. The proposed VP, which is developed in VHDL and prototyped on an FPGA, serves as a coprocessor for one or more scalar cores. Benchmarking shows that our VP can achieve very high performance. For example, it achieves a larger than 1500-fold speedup in the color space converting benchmark compared to running the code on a scalar core. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. Compared to running the benchmark on a VP without the shuffle engines, the speedup is 5.92 and 7.33 for the 64-point FFT without and with compiler optimization, respectively. Compared to runs on the scalar core, the achieved speedups for this benchmark are 52.07 and 110.45 without and with compiler optimization, respectively.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Microprocessors and Microsystems - Volume 39, Issues 4–5, June–July 2015, Pages 237–249
نویسندگان
, ,