Conservative Policy Gradient in Multi-critic Setting
文献类型:会议论文
作者 | Xi, Bao2,3![]() ![]() ![]() ![]() ![]() |
出版日期 | 2019-11 |
会议日期 | 2019.11.22-24 |
会议地点 | Hangzhou, China |
关键词 | inconsistancy stablility Q learning policy gradient |
英文摘要 | Twin Delayed Deep Deterministic policy gradient algorithm (TD3) addressed the overestimation bias problem by adopting a clipped Double Q-Learning method. As the two Q networks are different, the updation of a policy to maximize one Q function might minimize another, which is called inconsistency in this paper. Therefore, we propose an algorithm based on TD3, conservative policy gradient (CPG), that optimizes the policy with respect to the lower bound of the two Q functions to deal with the inconsistancy. In Q function learning, one-step estimate is usually used in target value estimation. However, due to the constantly changing target networks, there will be fluctuations in the estimation. As the target Q function is changing slowly, we combine the one-step estimate with zero-step estimate to avoid sharp changes. The experimental results illustrate that CPG outperforms TD3 and some other reinforcement learning methods on multiple MuJoCo benchmarks. |
语种 | 英语 |
源URL | [http://ir.ia.ac.cn/handle/173211/42219] ![]() |
专题 | 智能机器人系统研究 |
通讯作者 | Wang, Shuo |
作者单位 | 1.Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai, China 2.State Key Laboratory of Management and Control for Complex System, Institute of Automation, Beijing, China 3.University of Chinese Academy of Sciences, Beijing, China |
推荐引用方式 GB/T 7714 | Xi, Bao,Wang, Rui,Wang, Shuo,et al. Conservative Policy Gradient in Multi-critic Setting[C]. 见:. Hangzhou, China. 2019.11.22-24. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。