中国科学院机构知识库网格系统: Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

文献类型：期刊论文


作者	Zhang, Xi2,3 ; Zhang, Feifei 3; Xu, Changsheng1,2,3
刊名	IEEE TRANSACTIONS ON MULTIMEDIA
出版日期	2022
卷号	24 页码:2986-2997
关键词	Cognition Video recording Syntactics Visualization Task analysis Semantics Linguistics Visual Commonsense Reasoning explicit reasoning syntactic structure interpretability
ISSN号	1520-9210
DOI	10.1109/TMM.2021.3091882
通讯作者	Xu, Changsheng(csxu@nlpr.ia.ac.cn)
英文摘要	Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit cross-modal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit cross-modal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods.
资助项目	National Key Research and Development Program of China[2018AAA0100604] ; National Natural Science Foundation of China[61720106006] ; National Natural Science Foundation of China[62002355] ; National Natural Science Foundation of China[61721004] ; National Natural Science Foundation of China[61832002] ; National Natural Science Foundation of China[61532009] ; National Natural Science Foundation of China[61751211] ; National Natural Science Foundation of China[62072455] ; National Natural Science Foundation of China[U1705262] ; National Natural Science Foundation of China[U1836220] ; Key Research Program of Frontier Sciences of CAS[QYZDJSSW-JSC039] ; National Postdoctoral Program for Innovative Talents[BX20190367] ; Beijing Natural Science Foundation[L201001] ; Jiangsu Province key research, and development plan[BE2020036]
WOS研究方向	Computer Science ; Telecommunications
语种	英语
WOS记录号	WOS:000809408000024
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
资助机构	National Key Research and Development Program of China ; National Natural Science Foundation of China ; Key Research Program of Frontier Sciences of CAS ; National Postdoctoral Program for Innovative Talents ; Beijing Natural Science Foundation ; Jiangsu Province key research, and development plan
源URL	[http://ir.ia.ac.cn/handle/173211/49629]
专题	自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队
通讯作者	Xu, Changsheng
作者单位	1.Peng Cheng Lab, Shenzhen 518066, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China 3.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Zhang, Xi,Zhang, Feifei,Xu, Changsheng. Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2022,24:2986-2997.
APA	Zhang, Xi,Zhang, Feifei,&Xu, Changsheng.(2022).Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning.IEEE TRANSACTIONS ON MULTIMEDIA,24,2986-2997.
MLA	Zhang, Xi,et al."Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning".IEEE TRANSACTIONS ON MULTIMEDIA 24(2022):2986-2997.

入库方式： OAI收割

来源：自动化研究所

下载0

Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning

其他版本