中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
A visual-textual mutual guidance fusion network for remote sensing visual question answering

文献类型:期刊论文

作者Liu, Haolin1; Chen, Lei1,2; Lu, Xinchao1; Wang, Hao3; Bai, Lu4; Wang, Maoli5; Ren, Peng1
刊名PATTERN RECOGNITION
出版日期2026-08-01
卷号176页码:14
关键词Remote sensing visual question answering Transformer Visual-textual mutual guidance fusion
ISSN号0031-3203
DOI10.1016/j.patcog.2026.113258
通讯作者Wang, Hao(wangh_upc@163.com)
英文摘要Existing remote sensing visual question answering (RS VQA) methods are challenged by the presence of small objects in extensive backgrounds, limiting the establishment of explicit cross-modal semantic relationships between visual objects and textual questions. In addition, rich visual information in remote sensing images (RSIs) has not been fully utilized during multi-modal feature fusion. To address these limitations, it is essential to strengthen RS VQA with a more effective mechanism for cross-modal semantic representation and integration. To this end, we propose a novel framework based on visual-textual mutual guidance fusion network (VMGN). Specifically, a contrast enhancement module is developed to mitigate the influence of the backgrounds and enhance the visual features of small objects. It allows the objects to occupy a prominent position in the visual features. Additionally, the transformer is used to achieve cross-modal interaction between visual and text features. It effectively models the cross-modal semantic relationship between visual and text features. Furthermore, a visual-textual mutual guidance feature fusion module is developed to explore the rich information contained within the visual features of RSIs. Our proposed framework effectively explores the rich information contained within the visual features of RSIs to establish an explicit cross-modal semantic relationship between small objects and their corresponding text. The experimental results show that our proposed framework performs better than state-ofthe-art methods on three publicly available datasets. We release the reproducible code and the datasets used at https://github.com/LiuHL929/VMGN for public evaluation and possible extensive studies.
资助项目Shandong Provincial Natural Science Foundation[ZR2024MF061] ; National Natural Science Foundation of China[62576371]
WOS研究方向Computer Science ; Engineering
语种英语
WOS记录号WOS:001691721400002
出版者ELSEVIER SCI LTD
源URL[http://ir.qdio.ac.cn/handle/337002/204778]  
专题中国科学院海洋研究所
通讯作者Wang, Hao
作者单位1.China Univ Petr East China, Coll Oceanog & Space Informat, Qingdao 266580, Peoples R China
2.Chinese Acad Sci, Inst Oceanol, Qingdao 266000, Peoples R China
3.Laoshan Lab, Qingdao 266237, Peoples R China
4.Beijing Normal Univ, Sch Artificial Intelligence, Beijing 100875, Peoples R China
5.Qufu Normal Univ, Sch Cyber Sci & Engn, Qufu 273165, Peoples R China
推荐引用方式
GB/T 7714
Liu, Haolin,Chen, Lei,Lu, Xinchao,et al. A visual-textual mutual guidance fusion network for remote sensing visual question answering[J]. PATTERN RECOGNITION,2026,176:14.
APA Liu, Haolin.,Chen, Lei.,Lu, Xinchao.,Wang, Hao.,Bai, Lu.,...&Ren, Peng.(2026).A visual-textual mutual guidance fusion network for remote sensing visual question answering.PATTERN RECOGNITION,176,14.
MLA Liu, Haolin,et al."A visual-textual mutual guidance fusion network for remote sensing visual question answering".PATTERN RECOGNITION 176(2026):14.

入库方式: OAI收割

来源:海洋研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。