中国科学院机构知识库网格系统: A visual-textual mutual guidance fusion network for remote sensing visual question answering

A visual-textual mutual guidance fusion network for remote sensing visual question answering

文献类型：期刊论文


作者	Liu, Haolin 1; Chen, Lei 1,2; Lu, Xinchao 1; Wang, Hao 3; Bai, Lu 4; Wang, Maoli 5; Ren, Peng 1
刊名	PATTERN RECOGNITION
出版日期	2026-08-01
卷号	176 页码:14
关键词	Remote sensing visual question answering Transformer Visual-textual mutual guidance fusion
ISSN号	0031-3203
DOI	10.1016/j.patcog.2026.113258
通讯作者	Wang, Hao(wangh_upc@163.com)
英文摘要	Existing remote sensing visual question answering (RS VQA) methods are challenged by the presence of small objects in extensive backgrounds, limiting the establishment of explicit cross-modal semantic relationships between visual objects and textual questions. In addition, rich visual information in remote sensing images (RSIs) has not been fully utilized during multi-modal feature fusion. To address these limitations, it is essential to strengthen RS VQA with a more effective mechanism for cross-modal semantic representation and integration. To this end, we propose a novel framework based on visual-textual mutual guidance fusion network (VMGN). Specifically, a contrast enhancement module is developed to mitigate the influence of the backgrounds and enhance the visual features of small objects. It allows the objects to occupy a prominent position in the visual features. Additionally, the transformer is used to achieve cross-modal interaction between visual and text features. It effectively models the cross-modal semantic relationship between visual and text features. Furthermore, a visual-textual mutual guidance feature fusion module is developed to explore the rich information contained within the visual features of RSIs. Our proposed framework effectively explores the rich information contained within the visual features of RSIs to establish an explicit cross-modal semantic relationship between small objects and their corresponding text. The experimental results show that our proposed framework performs better than state-ofthe-art methods on three publicly available datasets. We release the reproducible code and the datasets used at https://github.com/LiuHL929/VMGN for public evaluation and possible extensive studies.
资助项目	Shandong Provincial Natural Science Foundation[ZR2024MF061] ; National Natural Science Foundation of China[62576371]
WOS研究方向	Computer Science ; Engineering
语种	英语
WOS记录号	WOS:001691721400002
出版者	ELSEVIER SCI LTD
源URL	[http://ir.qdio.ac.cn/handle/337002/204778]
专题	中国科学院海洋研究所
通讯作者	Wang, Hao
作者单位	1.China Univ Petr East China, Coll Oceanog & Space Informat, Qingdao 266580, Peoples R China 2.Chinese Acad Sci, Inst Oceanol, Qingdao 266000, Peoples R China 3.Laoshan Lab, Qingdao 266237, Peoples R China 4.Beijing Normal Univ, Sch Artificial Intelligence, Beijing 100875, Peoples R China 5.Qufu Normal Univ, Sch Cyber Sci & Engn, Qufu 273165, Peoples R China
推荐引用方式 GB/T 7714	Liu, Haolin,Chen, Lei,Lu, Xinchao,et al. A visual-textual mutual guidance fusion network for remote sensing visual question answering[J]. PATTERN RECOGNITION,2026,176:14.
APA	Liu, Haolin.,Chen, Lei.,Lu, Xinchao.,Wang, Hao.,Bai, Lu.,...&Ren, Peng.(2026).A visual-textual mutual guidance fusion network for remote sensing visual question answering.PATTERN RECOGNITION,176,14.
MLA	Liu, Haolin,et al."A visual-textual mutual guidance fusion network for remote sensing visual question answering".PATTERN RECOGNITION 176(2026):14.

入库方式： OAI收割

来源：海洋研究所

下载0

A visual-textual mutual guidance fusion network for remote sensing visual question answering

其他版本