ACM Transactions on Architecture and Code Optimization (TACO)
Volume 14 Issue 4, December 2017 Article No. 51, December 01--03, 2017
Integrated Heterogeneous System (IHS) processors pack throughput-oriented General-Purpose Graphics Pprocessing Units (GPGPUs) alongside latency-oriented Central Processing Units (CPUs) on the same die sharing certain resources, e.g., shared last-level cache, Network-on-Chip (NoC), and the main memory. The demands for memory accesses and other shared resources from GPU cores can exceed that of CPU cores by two to three orders of magnitude. This disparity poses significant problems in exploiting the full potential of these architectures.
In this article, we propose adding a large-capacity stacked DRAM, used as a shared last-level cache, for the IHS processors. However, adding the DRAMCache naively, leaves significant performance on the table due to the disparate demands from CPU and GPU cores for DRAMCache and memory accesses. In particular, the imbalance can significantly reduce the performance benefits that the CPU cores would have otherwise enjoyed with the introduction of the DRAMCache, necessitating a heterogeneity-aware management of this shared resource for improved performance.
In this article, we propose three simple techniques to enhance the performance of CPU application while ensuring very little to no performance impact to the GPU. Specifically, we propose (i) PrIS, a prioritization scheme for scheduling CPU requests at the DRAMCache controller; (ii) ByE, a selective and temporal bypassing scheme for CPU requests at the DRAMCache; and (iii) Chaining, an occupancy controlling mechanism for GPU lines in the DRAMCache through pseudo-associativity. The resulting cache, Heterogeneity-Aware Shared DRAMCache (HAShCache), is heterogeneity-aware and can adapt dynamically to address the inherent disparity of demands in an IHS architecture. Experimental evaluation of the proposed HAShCache results in an average system performance improvement of 41% over a naive DRAMCache and over 200% improvement over a baseline system with no stacked DRAMCache.