中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Conservative Policy Gradient in Multi-critic Setting

文献类型:会议论文

作者Xi, Bao2,3; Wang, Rui2; Wang, Shuo1,2,3; Lu, Tao2; Cai, Yinghao2
出版日期2019-11
会议日期2019.11.22-24
会议地点Hangzhou, China
关键词inconsistancy stablility Q learning policy gradient
英文摘要

Twin Delayed Deep Deterministic policy gradient algorithm (TD3) addressed the overestimation bias problem by adopting a clipped Double Q-Learning method. As the two Q networks are different, the updation of a policy to maximize one Q function might minimize another, which is called inconsistency in this paper. Therefore, we propose an algorithm based on TD3, conservative policy gradient (CPG), that optimizes the policy with respect to the lower bound of the two Q functions to deal with the inconsistancy. In Q function learning, one-step estimate is usually used in target value estimation. However, due to the constantly changing target networks, there will be fluctuations in the estimation. As the target Q function is changing slowly, we combine the one-step estimate with zero-step estimate to avoid sharp changes. The experimental results illustrate that CPG outperforms TD3 and some other reinforcement learning methods on multiple MuJoCo benchmarks.

语种英语
源URL[http://ir.ia.ac.cn/handle/173211/42219]  
专题智能机器人系统研究
通讯作者Wang, Shuo
作者单位1.Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China
2.State Key Laboratory of Management and Control for Complex System, Institute of Automation, Beijing, China
3.University of Chinese Academy of Sciences, Beijing, China
推荐引用方式
GB/T 7714
Xi, Bao,Wang, Rui,Wang, Shuo,et al. Conservative Policy Gradient in Multi-critic Setting[C]. 见:. Hangzhou, China. 2019.11.22-24.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。