中国科学院机构知识库网格系统: Barrier-Aware Warp Scheduling for Throughput Processors

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Barrier-Aware Warp Scheduling for Throughput Processors

文献类型：会议论文


作者	Yuxi Liu; Zhibin Yu; Lieven Eeckhout; Vijay Janapa Reddi; Yingwei Luo; Xiaolin Wang; Zhenlin Wang; Chengzhong Xu
出版日期	2016
会议名称	International Conference on Supercomputing (ICS2016)
会议地点	Istanbul, Turkey
英文摘要	Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).
收录类别	EI
语种	英语
源URL	[http://ir.siat.ac.cn:8080/handle/172644/10317]
专题	深圳先进技术研究院_数字所
作者单位	2016
推荐引用方式 GB/T 7714	Yuxi Liu,Zhibin Yu,Lieven Eeckhout,et al. Barrier-Aware Warp Scheduling for Throughput Processors[C]. 见:International Conference on Supercomputing (ICS2016). Istanbul, Turkey.

入库方式： OAI收割

来源：深圳先进技术研究院

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。