Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
文献类型:期刊论文
作者 | Zhang, Xi2,3![]() ![]() |
刊名 | IEEE TRANSACTIONS ON MULTIMEDIA
![]() |
出版日期 | 2022 |
卷号 | 24页码:2986-2997 |
关键词 | Cognition Video recording Syntactics Visualization Task analysis Semantics Linguistics Visual Commonsense Reasoning explicit reasoning syntactic structure interpretability |
ISSN号 | 1520-9210 |
DOI | 10.1109/TMM.2021.3091882 |
通讯作者 | Xu, Changsheng(csxu@nlpr.ia.ac.cn) |
英文摘要 | Given a question about an image, Visual Commonsense Reasoning (VCR) needs to provide not only a correct answer, but also a rationale to justify the answer. VCR is a challenging task due to the requirement of proper semantic alignment and reasoning between the image and linguistic expression. Recent approaches offer a great promise by exploring holistic attention mechanisms or graph-based networks, but most of them do implicit reasoning and ignore the semantic dependencies among the linguistic expression. In this paper, we propose a novel explicit cross-modal representation learning network for VCR by incorporating syntactic information into the visual reasoning and natural language understanding. The proposed method enjoys several merits. First, based on a two-branch neural module network, we can do explicit cross-modal reasoning guided by the high-level syntactic structure of linguistic expression. Second, the semantic structure of the linguistic expression is incorporated into a syntactic GCN to facilitate language understanding. Third, our explicit cross-modal representation learning network can provide a traceable reasoning-flow, which offers visible fine-grained evidence of the answer and rationale. Quantitative and qualitative evaluations on the public VCR dataset demonstrate that our approach performs favorably against state-of-the-art methods. |
资助项目 | National Key Research and Development Program of China[2018AAA0100604] ; National Natural Science Foundation of China[61720106006] ; National Natural Science Foundation of China[62002355] ; National Natural Science Foundation of China[61721004] ; National Natural Science Foundation of China[61832002] ; National Natural Science Foundation of China[61532009] ; National Natural Science Foundation of China[61751211] ; National Natural Science Foundation of China[62072455] ; National Natural Science Foundation of China[U1705262] ; National Natural Science Foundation of China[U1836220] ; Key Research Program of Frontier Sciences of CAS[QYZDJSSW-JSC039] ; National Postdoctoral Program for Innovative Talents[BX20190367] ; Beijing Natural Science Foundation[L201001] ; Jiangsu Province key research, and development plan[BE2020036] |
WOS研究方向 | Computer Science ; Telecommunications |
语种 | 英语 |
WOS记录号 | WOS:000809408000024 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
资助机构 | National Key Research and Development Program of China ; National Natural Science Foundation of China ; Key Research Program of Frontier Sciences of CAS ; National Postdoctoral Program for Innovative Talents ; Beijing Natural Science Foundation ; Jiangsu Province key research, and development plan |
源URL | [http://ir.ia.ac.cn/handle/173211/49629] ![]() |
专题 | 自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队 |
通讯作者 | Xu, Changsheng |
作者单位 | 1.Peng Cheng Lab, Shenzhen 518066, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China 3.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Zhang, Xi,Zhang, Feifei,Xu, Changsheng. Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2022,24:2986-2997. |
APA | Zhang, Xi,Zhang, Feifei,&Xu, Changsheng.(2022).Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning.IEEE TRANSACTIONS ON MULTIMEDIA,24,2986-2997. |
MLA | Zhang, Xi,et al."Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning".IEEE TRANSACTIONS ON MULTIMEDIA 24(2022):2986-2997. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。