Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference
文献类型:期刊论文
| 作者 | Yu, Xiyue1,2,3; Bi, Jun2; Wen, Yuanbo2; Xu, Jianxing1,2,3; Huang, Di2; Guo, Jiaming2; Li, Wei2; Du, Zidong2; Li, Jing1; Chen, Tianshi3 |
| 刊名 | ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
![]() |
| 出版日期 | 2025-12-01 |
| 卷号 | 22期号:4页码:25 |
| 关键词 | Code generation compiler optimization tensor computation |
| ISSN号 | 1544-3566 |
| DOI | 10.1145/3762660 |
| 英文摘要 | Optimizing deep learning inference, particularly reducing the execution latency of tensor computations at small batch sizes, is crucial for the successful and widespread adoption of deep neural network (DNN) models. However, current deep learning compilers and hand-tuned libraries often fail to achieve high hardware efficiency when executing small-batch workloads. The primary reason is the inherently sequential nature of reductions (e.g., along the hidden dimension in the flattened GEMM for LLM decoding), which is difficult to parallelize and therefore fails to fully utilize available hardware resources. In this article, we propose Swift, a novel search-based approach for efficiently generating high-performance programs for GPUs by maximizing hardware utilization. The key insight is that reduction parallelization can be incorporated into a unified representation alongside the existing tile structure, significantly expanding the search space for high-performance programs. Concretely, by enumerating all possible parallel mappings of loops, we first generate a large search space that contains high-performance programs. Then, to efficiently explore the extended search space, we employ subspace shifting exploration to identify promising regions, effectively prune large portions of the less-promising search space. We conduct experiments on three distinct GPU architectures using a diverse set of benchmarks representative of typical application scenarios. Experimental results demonstrate that Swift achieves an average speedup of 1.19x over the state-of-the-art compiler-based approaches. Moreover, compared with vendor-provided hand-tuned libraries, Swift achieves an average speedup of 2.40x. |
| 资助项目 | NSF of China[U22A2028] ; NSF of China[62302483] ; NSF of China[62222214] ; NSF of China[62341411] ; NSF of China[6240073476] ; NSF of China[62102398] ; NSF of China[62102399] ; NSF of China[62302478] ; NSF of China[62302482] ; NSF of China[62302480] ; NSF of China[62302481] ; Strategic Priority Research Program of the Chinese Academy of Sciences[XDB0660301] ; Strategic Priority Research Program of the Chinese Academy of Sciences[XDB0660302] ; CAS Project for Young Scientists in Basic Research[YSBR-029] ; Youth Innovation Promotion Association |
| WOS研究方向 | Computer Science |
| 语种 | 英语 |
| WOS记录号 | WOS:001667658800001 |
| 出版者 | ASSOC COMPUTING MACHINERY |
| 源URL | [http://119.78.100.204/handle/2XEOYT63/42845] ![]() |
| 专题 | 中国科学院计算技术研究所 |
| 通讯作者 | Guo, Qi |
| 作者单位 | 1.Univ Sci & Technol China, Hefei, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing, Peoples R China 3.Cambricon Technol, Beijing, Peoples R China |
| 推荐引用方式 GB/T 7714 | Yu, Xiyue,Bi, Jun,Wen, Yuanbo,et al. Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(4):25. |
| APA | Yu, Xiyue.,Bi, Jun.,Wen, Yuanbo.,Xu, Jianxing.,Huang, Di.,...&Guo, Qi.(2025).Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(4),25. |
| MLA | Yu, Xiyue,et al."Swift: High Parallelism Program Generation of Tensor Operators for Accelerating Deep Learning Inference".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.4(2025):25. |
入库方式: OAI收割
来源:计算技术研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。

