中国科学院机构知识库网格系统: So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

文献类型：期刊论文


作者	Zheng, Wenbo 4,5; Yan, Lan2,3 ; Wang, Fei-Yue1
刊名	IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS
出版日期	2023-10-17
页码	12
关键词	Graph attention graph reasoning multimodal graph self-attention text-based visual question answering
ISSN号	2168-2216
DOI	10.1109/TSMC.2023.3319964
通讯作者	Zheng, Wenbo(zwb2022@whut.edu.cn)
英文摘要	While texts related to images convey fundamental messages for scene understanding and reasoning, text-based visual question answering tasks concentrate on visual questions that require reading texts from images. However, most current methods add multimodal features that are independently extracted from a given image into a reasoning model without considering their inter-and intra-relationships according to three modalities (i.e., scene texts, questions, and images). To this end, we propose a novel text-based visual question answering model, multimodal graph reasoning. Our model first extracts intramodality relationships by taking the representations from identical modalities as semantic graphs. Then, we present graph multihead self-attention, which boosts each graph representation through graph-by-graph aggregation to capture the intermodality relationship. It is a case of "so many heads, so many wits" in the sense that as more semantic graphs are involved in this process, each graph representation becomes more effective. Finally, these representations are reprojected, and we perform answer prediction with their outputs. The experimental results demonstrate that our approach realizes substantially better performance compared with other state-of-the-art models.
WOS关键词	ATTENTIONS ; LANGUAGE ; VISION
资助项目	Natural Science Foundation of China[62303361] ; Natural Science Foundation of China[62302161] ; Natural Science Foundation of China[U1811463] ; Hainan Provincial Natural Science Foundation of China[623QN266] ; Fundamental Research Funds for the Central Universities[233110002] ; China National Postdoctoral Program for Innovative Talents[BX20230114] ; National Key Research and Development Program of China[2018AAA0101502] ; Natural Science Foundation of China[62303361] ; Natural Science Foundation of China[62302161] ; Natural Science Foundation of China[U1811463] ; Hainan Provincial Natural Science Foundation of China[623QN266] ; Fundamental Research Funds for the Central Universities[233110002] ; China National Postdoctoral Program for Innovative Talents[BX20230114] ; National Key Research and Development Program of China[2018AAA0101502]
WOS研究方向	Automation & Control Systems ; Computer Science
语种	英语
WOS记录号	WOS:001090709300001
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
资助机构	Natural Science Foundation of China ; Hainan Provincial Natural Science Foundation of China ; Fundamental Research Funds for the Central Universities ; China National Postdoctoral Program for Innovative Talents ; National Key Research and Development Program of China ; Natural Science Foundation of China ; Hainan Provincial Natural Science Foundation of China ; Fundamental Research Funds for the Central Universities ; China National Postdoctoral Program for Innovative Talents ; National Key Research and Development Program of China
源URL	[http://ir.ia.ac.cn/handle/173211/54316]
专题	多模态人工智能系统全国重点实验室自动化研究所_复杂系统管理与控制国家重点实验室_先进控制与自动化团队
通讯作者	Zheng, Wenbo
作者单位	1.Chinese Acad Sci, Inst Automat, State Key Lab Management & Control Complex Syst, Beijing 100190, Peoples R China 2.Natl Supercomp Ctr, Changsha 410082, Hunan, Peoples R China 3.Hunan Univ, Coll Comp Sci & Engn, Changsha 410082, Hunan, Peoples R China 4.Wuhan Univ Technol, Sanya Sci & Educ Innovat Pk, Sanya 572000, Peoples R China 5.Wuhan Univ Technol, Sch Comp Sci & Artificial Intelligence, Wuhan 430070, Peoples R China
推荐引用方式 GB/T 7714	Zheng, Wenbo,Yan, Lan,Wang, Fei-Yue. So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering[J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS,2023:12.
APA	Zheng, Wenbo,Yan, Lan,&Wang, Fei-Yue.(2023).So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering.IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS,12.
MLA	Zheng, Wenbo,et al."So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering".IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS (2023):12.

入库方式： OAI收割

来源：自动化研究所

下载0

So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question Answering

其他版本