Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017
Austin,TX, USA, February 04--08, 2017
Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution configuration, called grid, that specifies the size of a thread block (TB) and the number of thread blocks. Threads are allocated to and de-allocated from SMs at the granularity of a TB, whereas, scheduled and executed as a group of consecutive 32 threads, called warps. Due to various reasons, such as, different amounts of work, memory access latencies, etc., warps of a TB may finish the kernel execution at different points in time, causing the faster warps to wait for their slower sibling warps. This, in effect, reduces the utilization of resources of SMs and hence the performance of the GPU. We propose a simple and elegant technique to eliminate the waiting time of warps at the end of kernel execution and improve performance. The proposed technique uses persistent threads to define virtual thread blocks and virtual warps, and enables warps finishing earlier to execute the kernel again for another logical (user specified) thread block, without waiting for their sibling warps. We propose simple source to source transformations to use virtual thread blocks and virtual warps. Further, this technique enables us to design a warp scheduling algorithm that is aware of the progress made by the virtual thread blocks and virtual warps, and uses this knowledge to prioritise warps effectively. Evaluation on a diverse set of kernels from Rodinia, Parboil and GPGPU-SIM benchmark suites on the GPGPU-Sim simulator showed a geometric mean improvement of 1.06x over the baseline architecture that uses Greedy Then Old (GTO) warp scheduler and 1.09x over Loose Round Robin (LRR) warp scheduler.