中国科学院机构知识库网格系统: Conservative Policy Gradient in Multi-critic Setting

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Conservative Policy Gradient in Multi-critic Setting

文献类型：会议论文


作者	Xi, Bao2,3 ; Wang, Rui2 ; Wang, Shuo1,2,3 ; Lu, Tao2 ; Cai, Yinghao2
出版日期	2019-11
会议日期	2019.11.22-24
会议地点	Hangzhou, China
关键词	inconsistancy stablility Q learning policy gradient
英文摘要	Twin Delayed Deep Deterministic policy gradient algorithm (TD3) addressed the overestimation bias problem by adopting a clipped Double Q-Learning method. As the two Q networks are different, the updation of a policy to maximize one Q function might minimize another, which is called inconsistency in this paper. Therefore, we propose an algorithm based on TD3, conservative policy gradient (CPG), that optimizes the policy with respect to the lower bound of the two Q functions to deal with the inconsistancy. In Q function learning, one-step estimate is usually used in target value estimation. However, due to the constantly changing target networks, there will be fluctuations in the estimation. As the target Q function is changing slowly, we combine the one-step estimate with zero-step estimate to avoid sharp changes. The experimental results illustrate that CPG outperforms TD3 and some other reinforcement learning methods on multiple MuJoCo benchmarks.
语种	英语
源URL	[http://ir.ia.ac.cn/handle/173211/42219]
专题	智能机器人系统研究
通讯作者	Wang, Shuo
作者单位	1.Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China 2.State Key Laboratory of Management and Control for Complex System, Institute of Automation, Beijing, China 3.University of Chinese Academy of Sciences, Beijing, China
推荐引用方式 GB/T 7714	Xi, Bao,Wang, Rui,Wang, Shuo,et al. Conservative Policy Gradient in Multi-critic Setting[C]. 见:. Hangzhou, China. 2019.11.22-24.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。