PhD Thesis, Supercomputer Education and Research Centre,
Indian Institute of Science, Bangalore, India, November 2017.
There has been a tremendous growth in the use of Graphics Processing Units (GPU) for the acceleration of general purpose applications. The growth is primarily due to the huge computing power offered by the GPUs and the emergence of programming languages such as CUDA and OpenCL. A typical GPU consists of several 100s to a few 1000s of Single Instruction Multiple Data (SIMD) cores, organized as 10s of Streaming Multiprocessors (SMs), each having several SIMD cores which operate in a lock-step manner, offering a few TeraFLOPS of performance in a single socket. SMs execute instructions from a group of consecutive threads, called warps. At each cycle, an SM schedules a warp from a group of active warps and can context switch among the active warps to hide various stalls. However, various factors, such as global memory latency, divergence among warps of a thread block (TB), branch divergence among threads of a warp (Control Divergence), number of active warps, etc., can significantly impact the ability of a warp scheduler to hide stalls. This reduces the speedup of applications running on the GPU. Further, applications containing loops with potential cross iteration dependences, do not utilize the available resources (SIMD cores) effectively and hence suffer in terms of performance. In this thesis, we propose several mechanisms which address the above issues and enhance the performance of GPU applications through efficient warp scheduling, taming branch and warp divergence, and runtime parallelization.
First, we propose RLWS, a Reinforcement Learning (RL) based Warp Scheduler which uses unsupervised learning to schedule warps based on the current state of the core and the long-term benefits of scheduling actions. As the design space involving the state variables used by the RL and the RL parameters (such as learning and exploration rates, reward and penalty values, etc.) is large, we use a Genetic Algorithm to identify the useful subset of state variables and RL parameter values. We evaluated the proposed RL based scheduler using the GPGPU-SIM simulator on a large number of applications from the Rodinia, Parboil, CUDA-SDK and GPGPU-SIM benchmark suites. Our RL based implementation achieved an average speedup of 1.06x over the Loose Round Robin (LRR) strategy and 1.07x over the Two-Level (TL) strategy. A salient feature of RLWS is that it is robust, i.e., performs nearly as well as the best performing warp scheduler, consistently across a wide range of applications. Using the insights obtained from RLWS, we designed PRO, a heuristic warp scheduler which in addition to hiding the long latencies of certain operations, reduces the waiting time of warps at synchronization points. Evaluation of the proposed algorithm using the GPGPU-SIM simulator on a diverse set of applications showed an average speedup of 1.07x over the LRR warp scheduler and 1.08x over the TL warp scheduler.
In the second part of the thesis, we address problems due to warp and branch divergences. First, many GPU kernels exhibit warp divergence due to various reasons such as, different amounts of work, cache misses, and thread divergence. Also, we observed that some kernels contain code which is redundant across TBs, i.e., all TBs will execute the code identically and hence compute the same results. To improve performance of such kernels, we propose a solution based on the concept of virtual TBs and loop independent code motion. We propose necessary code transformations which enable one virtual TB to execute the kernel code for multiple real TBs. We evaluated this technique using the GPGPU-SIM simulator on a diverse set of applications and observed an average improvement of 1.08x over the LRR and 1.04x over the Greedy Then Old (GTO) warp scheduling algorithms. Second, branch divergence causes execution of diverging branches to be serialized to execute only one control flow path at a time. Existing stack based hardware mechanism to reconverge threads causes duplicate execution of code for unstructured control flow graphs (CFG). We propose a simple and elegant transformation to convert an unstructured CFG to a structured CFG. The transformation eliminates duplicate execution of user code while incurring only a linear increase in the number of basic blocks and also the number of instructions. We implemented the proposed transformation at the PTX level using the Ocelot compiler infrastructure and demonstrate that the proposed technique is effective in handling the performance problem due to divergence in unstructured CFGs.
Our third proposal is to enable efficient execution of loops with indirect memory accesses that can potentially cause cross iteration dependences. Such dependences are hard to detect using existing compilation techniques. We present an algorithm to compute at run-time, the cross iteration dependences in such loops, using both the CPU and the GPU. It effectively uses the compute capabilities of the GPU to collect the memory accesses performed by the iterations. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Experimental evaluation on real hardware (NVIDIA GPUs) reveals that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences.