中国科学院机构知识库网格系统: ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads

ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads

文献类型：期刊论文


作者	Hu, Cunchen 1,2; Huang, Heyang 1,2; Xu, Liangliang 3; Chen, Xusheng 4; Wang, Chenxi 1,2; Xu, Jiang 4; Chen, Shuang 4; Feng, Hao 4; Wang, Sa 1,2; Bao, Yungang 2
刊名	ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION
出版日期	2025-06-01
卷号	22 期号:2 页码:24
关键词	LLM serving disaggregated interference schedule
ISSN号	1544-3566
DOI	10.1145/3732941
英文摘要	Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in ShuffleInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computation-saturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that Shuffle-Infer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in terms of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.
资助项目	Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320000] ; Strategic Priority Research Program of Chinese Academy of Sciences[XDA0320300] ; National Natural Science Foundation of China[62090022] ; National Natural Science Foundation of China[U24B6012] ; National Natural Science Foundation of China[62172388] ; China Postdoctoral Science Foundation[2024M762550] ; Shaanxi Postdoctoral Research Foundation[2024BSHSDZZ102]
WOS研究方向	Computer Science
语种	英语
WOS记录号	WOS:001533499400010
出版者	ASSOC COMPUTING MACHINERY
源URL	[http://119.78.100.204/handle/2XEOYT63/42074]
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Wang, Sa; Shan, Yizhou
作者单位	1.Univ Chinese Acad Sci, Beijing, Peoples R China 2.Chinese Acad Sci, State Key Lab Processors, Inst Comp Technol, Beijing, Peoples R China 3.Xidian Univ, Inst Math & Interdisciplinary Sci, Xian, Peoples R China 4.Huawei Cloud, Hangzhou, Peoples R China
推荐引用方式 GB/T 7714	Hu, Cunchen,Huang, Heyang,Xu, Liangliang,et al. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads[J]. ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,2025,22(2):24.
APA	Hu, Cunchen.,Huang, Heyang.,Xu, Liangliang.,Chen, Xusheng.,Wang, Chenxi.,...&Shan, Yizhou.(2025).ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION,22(2),24.
MLA	Hu, Cunchen,et al."ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads".ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION 22.2(2025):24.

入库方式： OAI收割

来源：计算技术研究所

下载0

ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads

其他版本