Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension
文献类型:期刊论文
作者 | Zhang, Yujia1![]() ![]() ![]() ![]() ![]() |
刊名 | IEEE TRANSACTIONS ON IMAGE PROCESSING
![]() |
出版日期 | 2024 |
卷号 | 33页码:3256-3270 |
关键词 | Feature extraction Visualization Task analysis Representation learning Location awareness Linguistics Grounding Video-based referring expression comprehension multi-stage learning image-language cross-generative fusion consistency loss |
ISSN号 | 1057-7149 |
DOI | 10.1109/TIP.2024.3394260 |
通讯作者 | Zhang, Yujia(zhangyujia2014@ia.ac.cn) |
英文摘要 | Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results. |
WOS关键词 | TRACKING |
资助项目 | National Natural Science Foundation of China |
WOS研究方向 | Computer Science ; Engineering |
语种 | 英语 |
WOS记录号 | WOS:001216329500001 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
资助机构 | National Natural Science Foundation of China |
源URL | [http://ir.ia.ac.cn/handle/173211/58321] ![]() |
专题 | 智能机器人系统研究 |
通讯作者 | Zhang, Yujia |
作者单位 | 1.Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence S, Beijing 100190, Peoples R China 2.Meituan, Beijing 100015, Peoples R China 3.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100190, Peoples R China 4.Chinese Acad Sci, Inst Automat, Lab Cognit & Decis Intelligence Complex Syst, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Zhang, Yujia,Li, Qianzhong,Pan, Yi,et al. Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING,2024,33:3256-3270. |
APA | Zhang, Yujia,Li, Qianzhong,Pan, Yi,Zhao, Xiaoguang,&Tan, Min.(2024).Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension.IEEE TRANSACTIONS ON IMAGE PROCESSING,33,3256-3270. |
MLA | Zhang, Yujia,et al."Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension".IEEE TRANSACTIONS ON IMAGE PROCESSING 33(2024):3256-3270. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。