中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
DeFT: Relaxing data dependencies for efficient communication scheduling in distributed training

文献类型:期刊论文

作者Meng, Lin1,2,3; Sun, Yuzhong1,3; Zhu, Jie3,4
刊名FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE
出版日期2026-02-01
卷号175页码:15
关键词Distributed deep learning Communication scheduling Data parallelism
ISSN号0167-739X
DOI10.1016/j.future.2025.108103
英文摘要Communication scheduling aims to reduce communication bottlenecks in data parallel training (DP) by maximizing the overlap between computation and communication. However, existing schemes fall short due to three main issues: (1) hard data dependencies break some overlapping between communication and computation; (2) high coverage rates impair further improvement on performance; (3) imbalanced communication/computation times of tensors caused by partitioning/fusion strategies cause more bubbles. Therefore, we propose a new communication scheduling scheme DeFT, whose key insight is to relax data dependencies and support flexible scheduling in distributed training without reordering bucket communications. DeFT uncovers new overlapping chances in training by transforming the scheduling problem into multiple knapsack problems. Specifically, DeFT eliminates hard dependencies with delayed updates, reducing the coverage rate by adjusting update frequency and utilizing heterogeneous communication links, merging the computation times of backward or forward as the knapsack capacity to avoid the negative impact of unbalanced tensors. Additionally, DeFT preserves training accuracy by adjusting its scheduling strategy via convergence loss quantification. Extensive experiments with 16 A100 GPUs showed that DeFT achieved speedups of 29% to 115% on three representative benchmarks compared to US-Byte and Bytescheduler with no loss of accuracy.
资助项目Science and Technology Innovation 2030-Major Project[2022ZD0119104]
WOS研究方向Computer Science
语种英语
WOS记录号WOS:001565585500003
出版者ELSEVIER
源URL[http://119.78.100.204/handle/2XEOYT63/41722]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Sun, Yuzhong
作者单位1.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
2.Univ Chinese Acad Sci, Beijing 101408, Peoples R China
3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Chinese Comp Architecture, Beijing 100864, Peoples R China
4.Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China
推荐引用方式
GB/T 7714
Meng, Lin,Sun, Yuzhong,Zhu, Jie. DeFT: Relaxing data dependencies for efficient communication scheduling in distributed training[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE,2026,175:15.
APA Meng, Lin,Sun, Yuzhong,&Zhu, Jie.(2026).DeFT: Relaxing data dependencies for efficient communication scheduling in distributed training.FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE,175,15.
MLA Meng, Lin,et al."DeFT: Relaxing data dependencies for efficient communication scheduling in distributed training".FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 175(2026):15.

入库方式: OAI收割

来源:计算技术研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。