BEVBert: Multimodal Map Pre-training for Language-guided Navigation
文献类型:会议论文
作者 | Dong An![]() ![]() ![]() |
出版日期 | 2023-10 |
会议日期 | 2023-10-2 |
会议地点 | Paris, France |
英文摘要 | Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panora mas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent’s spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-of-the-art on four VLN benchmarks. |
会议录 | Proceedings of the IEEE International Conference on Computer Vision
![]() |
语种 | 英语 |
源URL | [http://ir.ia.ac.cn/handle/173211/56611] ![]() |
专题 | 自动化研究所_智能感知与计算研究中心 |
作者单位 | 1.Institute of Automation, Chinese Academy of Sciences 2.Australian Institute for Machine Learning, University of Adelaide 3.Shanghai AI Laboratory 4.Nanjing University 5.SenseTime Research 6.School of Future Technology, UCAS |
推荐引用方式 GB/T 7714 | Dong An,Yuankai Qi,Yangguang Li,et al. BEVBert: Multimodal Map Pre-training for Language-guided Navigation[C]. 见:. Paris, France. 2023-10-2. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。