中国科学院机构知识库网格系统: Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

文献类型：期刊论文


作者	Zhang, Yujia1 ; Li, Qianzhong2 ; Pan, Yi1,3 ; Zhao, Xiaoguang1 ; Tan, Min4
刊名	IEEE TRANSACTIONS ON IMAGE PROCESSING
出版日期	2024
卷号	33 页码:3256-3270
关键词	Feature extraction Visualization Task analysis Representation learning Location awareness Linguistics Grounding Video-based referring expression comprehension multi-stage learning image-language cross-generative fusion consistency loss
ISSN号	1057-7149
DOI	10.1109/TIP.2024.3394260
通讯作者	Zhang, Yujia(zhangyujia2014@ia.ac.cn)
英文摘要	Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.
WOS关键词	TRACKING
资助项目	National Natural Science Foundation of China
WOS研究方向	Computer Science ; Engineering
语种	英语
WOS记录号	WOS:001216329500001
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
资助机构	National Natural Science Foundation of China
源URL	[http://ir.ia.ac.cn/handle/173211/58321]
专题	智能机器人系统研究
通讯作者	Zhang, Yujia
作者单位	1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China 2.Meituan, Beijing 100015, Peoples R China 3.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China 4.Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Zhang, Yujia,Li, Qianzhong,Pan, Yi,et al. Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING,2024,33:3256-3270.
APA	Zhang, Yujia,Li, Qianzhong,Pan, Yi,Zhao, Xiaoguang,&Tan, Min.(2024).Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.IEEE TRANSACTIONS ON IMAGE PROCESSING,33,3256-3270.
MLA	Zhang, Yujia,et al."Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension".IEEE TRANSACTIONS ON IMAGE PROCESSING 33(2024):3256-3270.

入库方式： OAI收割

来源：自动化研究所

下载0

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension

其他版本