It is expected that the first exascale supercomputer will be deployed within the next 10 years, but the programming model that allows us easy development and high performance is still unknown. Recent supercomputers deploy manycore accelerators such as GPUs in order to accelerate a wide range of applications. Asynchronous Partitioned Global Address Space (APGAS) programming model abstracts deep memory hierarchy such as distributed memory and GPU device memory through the combination of a global view of data and asynchronous operations. The APGAS model offers a flexible way for a wide range of applications to express many patterns of concurrency, communication, and control for computing on massively parallel computing environments. The APGAS model is a possible programming model for computing on exascale supercomputers since the APGAS model can utilize multiple nodes as well as multiple GPUs with high productivity.
Although the APGAS model can utilize multiple GPUs, how much GPUs accelerate applications using APGAS model remains unclear. While the APGAS model allows us highly productive programming for massively parallel computing, the abstraction of deep memory hierarchy may limit performance since the memory abstraction limits domains of performance tuning. Moreover, when using multiple GPUs, the scalability of multiple GPUs in the APGAS model is also an open problem.
In order to address the problems, we give a comparative analysis of the APGAS model in X10 with the standard message passing model, by using lattice Quantum Chromodynamics (QCD) as an example, which is one of the most challenging applications for supercomputers. We further analyze the performance of lattice QCD in X10 on multiple GPUs. We firstly implement a CPU-based lattice QCD in X10 by fully porting a sequential CPU implementation in C into X10. Then we extend the X10 implementation into a multi-GPU-based implementation by implementing CUDA kernels and partitioning four-dimensional grid into multiple places in order to handle memories on multiple nodes and GPU device memory, where the place indicates a part of memory that corresponds to a host memory or a device memory on a compute node. We further apply several optimizations including data layout optimization for coalesced memory access on GPUs and communication overlapping using asynchronous memory copy functions in X10.
Our experimental results on TSUBAME2.5 show that our X10 implementation on multiple GPUs outperforms a MPI implementation on multi-core CPUs in both strong scalability and weak scalability using multiple nodes. The strong scalability evaluation shows our X10 implementation on 16 GPUs exhibits 4.57x speedup over MPI on multi-core CPUs. The weak scalability evaluation also shows our X10 implementation on 32 GPUs exhibits 11.0x speedup over MPI on multi-core CPUs. The results indicate that the APGAS programming model on GPUs scales well and accelerates the lattice QCD application significantly.