中国科学院机构知识库网格系统: Low-Latency PIM Accelerator for Edge LLM Inference

Low-Latency PIM Accelerator for Edge LLM Inference

文献类型：期刊论文


作者	Wang, Xinyu 1,2; Sun, Xiaotian 1,2,3; Li, Wanqian 1,2; Min, Feng 3; Zhang, Xiaoyu 3; Zhang, Xinjiang 3; Han, Yinhe 3; Chen, Xiaoming 3
刊名	IEEE COMPUTER ARCHITECTURE LETTERS
出版日期	2025-07-01
卷号	24 期号:2 页码:321-324
关键词	Random access memory Low latency communication Engines Bandwidth Vectors Registers Quantization (signal) Energy efficiency Hardware Computational modeling Large language model inference processing-in-memory edge accelerator
ISSN号	1556-6056
DOI	10.1109/LCA.2025.3618104
英文摘要	Deploying large language models (LLMs) on edge devices has the potentials for low-latency inference and privacy protection. However, meeting the substantial bandwidth demands of latency-oriented edge devices is challenging due to the strict power constraints of edge devices. Resistive random-access memory (RRAM)-based processing-in-memory (PIM) is an ideal solution for this challenge, thanks to its low read power and high internal bandwidth. Moreover, applying quantization methods, which require different precisions for weights and activations, is a common practice in edge inference. But existing accelerators cannot fully leverage the benefits of quantization, as they lack multiply-accumulate (MAC) units optimized for mixed-precision operands. To achieve low-latency edge inference, we design an RRAM-based PIM die that integrates dedicated energy-efficient MAC units, providing both computation and storage capabilities. Coupled with a dynamic random-access memory (DRAM) die for storing the key-value (KV) cache, we propose Lyla, an accelerator for low-latency edge LLM inference. Experimental results show that Lyla achieves 3.8x, 2.4x, and 1.2x latency improvements over a GPU and two DRAM-based PIM accelerators, respectively.
资助项目	National Natural Science Foundation of China[62488101] ; National Natural Science Foundation of China[62495104] ; National Natural Science Foundation of China[62025404] ; Youth Innovation Promotion Association CAS
WOS研究方向	Computer Science
语种	英语
WOS记录号	WOS:001600730100005
出版者	IEEE COMPUTER SOC
源URL	[http://119.78.100.204/handle/2XEOYT63/41586]
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Chen, Xiaoming
作者单位	1.Chinese Acad Sci, Univ Chinese Acad Sci, State Key Lab Processors, Beijing 101408, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Beijing 100190, Peoples R China 3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Processors, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Wang, Xinyu,Sun, Xiaotian,Li, Wanqian,et al. Low-Latency PIM Accelerator for Edge LLM Inference[J]. IEEE COMPUTER ARCHITECTURE LETTERS,2025,24(2):321-324.
APA	Wang, Xinyu.,Sun, Xiaotian.,Li, Wanqian.,Min, Feng.,Zhang, Xiaoyu.,...&Chen, Xiaoming.(2025).Low-Latency PIM Accelerator for Edge LLM Inference.IEEE COMPUTER ARCHITECTURE LETTERS,24(2),321-324.
MLA	Wang, Xinyu,et al."Low-Latency PIM Accelerator for Edge LLM Inference".IEEE COMPUTER ARCHITECTURE LETTERS 24.2(2025):321-324.

入库方式： OAI收割

来源：计算技术研究所

下载0

Low-Latency PIM Accelerator for Edge LLM Inference

其他版本