中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Low-Latency PIM Accelerator for Edge LLM Inference

文献类型:期刊论文

作者Wang, Xinyu1,2; Sun, Xiaotian1,2,3; Li, Wanqian1,2; Min, Feng3; Zhang, Xiaoyu3; Zhang, Xinjiang3; Han, Yinhe3; Chen, Xiaoming3
刊名IEEE COMPUTER ARCHITECTURE LETTERS
出版日期2025-07-01
卷号24期号:2页码:321-324
关键词Random access memory Low latency communication Engines Bandwidth Vectors Registers Quantization (signal) Energy efficiency Hardware Computational modeling Large language model inference processing-in-memory edge accelerator
ISSN号1556-6056
DOI10.1109/LCA.2025.3618104
英文摘要Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8x, 2.4x, and 1.2x latency improvements over a GPU and two DRAM-based PIM accelerators, respectively.
资助项目National Natural Science Foundation of China[62488101] ; National Natural Science Foundation of China[62495104] ; National Natural Science Foundation of China[62025404] ; Youth Innovation Promotion Association CAS
WOS研究方向Computer Science
语种英语
WOS记录号WOS:001600730100005
出版者IEEE COMPUTER SOC
源URL[http://119.78.100.204/handle/2XEOYT63/41586]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Chen, Xiaoming
作者单位1.Chinese Acad Sci, Univ Chinese Acad Sci, State Key Lab Processors, Beijing 101408, Peoples R China
2.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China
3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Wang, Xinyu,Sun, Xiaotian,Li, Wanqian,et al. Low-Latency PIM Accelerator for Edge LLM Inference[J]. IEEE COMPUTER ARCHITECTURE LETTERS,2025,24(2):321-324.
APA Wang, Xinyu.,Sun, Xiaotian.,Li, Wanqian.,Min, Feng.,Zhang, Xiaoyu.,...&Chen, Xiaoming.(2025).Low-Latency PIM Accelerator for Edge LLM Inference.IEEE COMPUTER ARCHITECTURE LETTERS,24(2),321-324.
MLA Wang, Xinyu,et al."Low-Latency PIM Accelerator for Edge LLM Inference".IEEE COMPUTER ARCHITECTURE LETTERS 24.2(2025):321-324.

入库方式: OAI收割

来源:计算技术研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。