کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
10349868 863719 2013 13 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Fast discontinuous Galerkin lattice-Boltzmann simulations on GPUs via maximal kernel fusion
موضوعات مرتبط
مهندسی و علوم پایه شیمی شیمی تئوریک و عملی
پیش نمایش صفحه اول مقاله
Fast discontinuous Galerkin lattice-Boltzmann simulations on GPUs via maximal kernel fusion
چکیده انگلیسی
A GPU implementation of the discontinuous Galerkin lattice-Boltzmann method with square spectral elements, and highly optimised for speed and precision of calculations is presented. An extensive analysis of the numerous variants of the fluid solver unveils that best performance is obtained by maximising CUDA kernel fusion and by arranging the resulting kernel tasks so as to trigger memory coherent and scattered loads in a specific manner, albeit at the cost of introducing cross-thread load unbalancing. Surprisingly, any attempt to vanish this, to maximise thread occupancy and to adopt conventional work tiling or distinct custom kernels highly tuned via ad hoc data and computation layouts invariably deteriorate performance. As such, this work sheds light into the possibility to hide fetch latencies of workloads involving heterogeneous loads in a way that is more effective than what is achieved with frequently suggested techniques. When simulating the lid-driven cavity on a NVIDIA GeForce GTX 480 via a 5-stage 4th-order Runge-Kutta (RK) scheme, the first four digits of the obtained centreline velocity values, or more, converge to those of the state-of-the-art literature data at a simulation speed of 7.0G primitive variable updates per second during the collision stage and 4.4G ones during each RK step of the advection by employing double-precision arithmetic (DPA) and a computational grid of 642 4×4-point elements only. The new programming engine leads to about 2× performance w.r.t. the best programming guidelines in the field. The new fluid solver on the above GPU is also 20-30 times faster than a highly optimised version running on a single core of a Intel Xeon X5650 2.66 GHz.
ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computer Physics Communications - Volume 184, Issue 3, March 2013, Pages 537-549
نویسندگان
,