中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Weakly-Supervised Video Object Grounding Via Learning Uni-Modal Associations

文献类型:期刊论文

作者Wang, Wei1,2; Gao, Junyu1,2; Xu, Changsheng1,2,3
刊名IEEE Transactions on Multimedia
出版日期2022
页码1-12
英文摘要

Grounding objects described in natural language to visual regions in the video is a crucial capability needed in vision-and-language fields. In this paper, we deal with the weakly-supervised video object grounding (WSVOG) task, where only video-sentence pairs are provided for learning. The essence of this task is to learn the cross-modal associations between words in textual modality and regions in visual modality. Despite the recent progress, we find that most existing methods focus on the association learning for cross-modal samples, while the rich and complementary information within uni-modal samples has not been fully exploited. To this end, we propose to explicitly learn uni-modal associations on both textual and visual sides, so as to fully exploit the useful uni-modal information for accurate video object grounding. Specifically, (1) we learn textual prototypes by considering rich contextual information of the same object in different sentences, and (2) we estimate visual prototypes in an adaptive manner so as to overcome the uncertainties in selecting object-relevant visual regions. Besides, a cross-modal correspondence is learned which not only bridges the visual and textual modalities for WSVOG task, but also tightly cooperates with the uni-modal association learning process. We conduct extensive experiments on three popular datasets, and the favorable results demonstrate the effectiveness of our method.

源URL[http://ir.ia.ac.cn/handle/173211/51522]  
专题多模态人工智能系统全国重点实验室
作者单位1.School of Artificial Intelligence, University of Chinese Academy of Sciences
2.National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
3.PengCheng Laboratory
推荐引用方式
GB/T 7714
Wang, Wei,Gao, Junyu,Xu, Changsheng. Weakly-Supervised Video Object Grounding Via Learning Uni-Modal Associations[J]. IEEE Transactions on Multimedia,2022:1-12.
APA Wang, Wei,Gao, Junyu,&Xu, Changsheng.(2022).Weakly-Supervised Video Object Grounding Via Learning Uni-Modal Associations.IEEE Transactions on Multimedia,1-12.
MLA Wang, Wei,et al."Weakly-Supervised Video Object Grounding Via Learning Uni-Modal Associations".IEEE Transactions on Multimedia (2022):1-12.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。