Lab for High Performance Computing SERC, Indian Institute of Science
Home | People | Research | Awards/Honours | Publications | Lab Resources | Gallery | Contact Info | Sponsored Research
Tech. Reports | Conferences / Journals | Theses / Project Reports

PRO: Progress Aware GPU Warp Scheduling Algorithm

IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Hyderabad, India, May 25--29, 2015


  1. Jayvant Anantpur, Supercomputer Education and Research Centre
  2. R. Govindarajan, Supercomputer Education and Research Centre; Department of Computer Science and Automation


Graphics Processing Units (GPUs) contain multiple SIMD cores and each core can run a large number of threads concurrently. Threads in a core are scheduled and executed in fixed sized groups, called warps. Each core contains one or more warp schedulers that select and execute warps from a pool of ready warps. In spite of having a large number of concurrent warps - 48 on NVIDIA Fermi architecture GPU - on many GPGPU applications, current warp scheduling algorithms can not effectively utilize the hardware resources, resulting in stall cycles and loss in performance. The main reason for this is current warp scheduling algorithms mostly focus on long latency operations, especially global memory accesses, and do not take into account factors such as the progress of each thread block and the number of ready warps. In this paper, we propose, PRO, a progress warp scheduling algorithm that not only focuses on finishing individual thread blocks faster but also on reducing the overall execution time. These goals are achieved by dynamically prioritizing thread blocks and warps, based on their progress. We implemented our proposed algorithm in the GPGPU-SIM simulator and evaluated on various applications from GPGPU-SIM, Rodinia and CUDA SDK benchmark suites. We achieved an average speedup of 1.12x and a maximum speedup of 1.94x over the commonly used Loose Round Robin warp scheduling algorithm. Over the Two Level warp scheduler, our algorithm showed an average speedup of 1.13x and a maximum speedup of 1.6x. Our proposed solution requires only a very small increase in the GPU hardware


Full Text