中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads

文献类型:期刊论文

作者Hu, Cunchen1,2; Huang, Heyang1,2; Xu, Liangliang3; Chen, Xusheng4; Wang, Chenxi1,2; Xu, Jiang4; Chen, Shuang4; Feng, Hao4; Wang, Sa1,2; Bao, Yungang2
刊名ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
出版日期2025-06-01
卷号22期号:2页码:24
关键词LLM serving disaggregated interference schedule
ISSN号1544-3566
DOI10.1145/3732941
英文摘要Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that Shuffle-Infer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.
资助项目Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320000] ; Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320300] ; National Natural Science Foundation of China[62090022] ; National Natural Science Foundation of China[U24B6012] ; National Natural Science Foundation of China[62172388] ; China Postdoctoral Science Foundation[2024M762550] ; Shaanxi Postdoctoral Research Foundation[2024BSHSDZZ102]
WOS研究方向Computer Science
语种英语
WOS记录号WOS:001533499400010
出版者ASSOC COMPUTING MACHINERY
源URL[http://119.78.100.204/handle/2XEOYT63/42074]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Wang, Sa; Shan, Yizhou
作者单位1.Univ Chinese Acad Sci, Beijing, Peoples R China
2.Chinese Acad Sci, State Key Lab Processors, Inst Comp Technol, Beijing, Peoples R China
3.Xidian Univ, Inst Math & Interdisciplinary Sci, Xian, Peoples R China
4.Huawei Cloud, Hangzhou, Peoples R China
推荐引用方式
GB/T 7714
Hu, Cunchen,Huang, Heyang,Xu, Liangliang,et al. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(2):24.
APA Hu, Cunchen.,Huang, Heyang.,Xu, Liangliang.,Chen, Xusheng.,Wang, Chenxi.,...&Shan, Yizhou.(2025).ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(2),24.
MLA Hu, Cunchen,et al."ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.2(2025):24.

入库方式: OAI收割

来源:计算技术研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。