中国科学院机构知识库网格系统: Token-level Direct Preference Optimization

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Token-level Direct Preference Optimization

文献类型：会议论文


作者	Zeng,Yongcheng 1; Liu,Guoqing3 ; Ma,Weiyu 1; Yang,Ning1 ; Zhang,Haifeng 1; Wang,Jun 2
出版日期	2024
会议日期	2024/7/21-27
会议地点	Vienna, Austria
英文摘要	Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often uti- lizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this pa- per, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing pol- icy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence con- straints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserv- ing simplicity without the need for explicit re- ward modeling. Experimental results across vari- ous text tasks demonstrate TDPO’s superior per- formance in balancing alignment with genera- tion diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open- sourced at https://github.com/Vance0124/Token- level-Direct-Preference-Optimization.
语种	英语
源URL	[http://ir.ia.ac.cn/handle/173211/57249]
专题	复杂系统认知与决策实验室_群体决策智能团队
通讯作者	Zhang,Haifeng; Wang,Jun
作者单位	1.Institute of Automation, Chinese Academy of Sciences 2.University College London 3.Microsoft Research AI4Science
推荐引用方式 GB/T 7714	Zeng,Yongcheng,Liu,Guoqing,Ma,Weiyu,et al. Token-level Direct Preference Optimization[C]. 见:. Vienna, Austria. 2024/7/21-27.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。