中国科学院机构知识库网格系统: Inferential and Commonsense Visual Question Generation

Inferential and Commonsense Visual Question Generation

文献类型：期刊论文


作者	Bi, Chao 1,2; Wang, Shuhui 2,3; Li, Na 4; Huang, Qingming 1,2
刊名	IEEE TRANSACTIONS ON MULTIMEDIA
出版日期	2025
卷号	27 页码:7796-7809
关键词	Visual question generation visual question answering multimodal datasets knowledge and inference Visual question generation visual question answering multimodal datasets knowledge and inference
ISSN号	1520-9210
DOI	10.1109/TMM.2025.3604975
英文摘要	The Visual Question Generation (VQG) task generally aims to produce questions based on images in natural language. Existing studies often handle VQG as a reverse Visual Question Answering (VQA), training data-driven generators on VQA datasets. However, this solution pipeline struggles to generate high-quality questions that effectively challenge robots and humans, even by leveraging the most advanced large-scale foundational models. There are also some other VQG methods depending on elaborate and costly manual preprocessing heavily. To address these limitations, we propose a novel method with a two-module framework for automatically generating inferential visual questions that also follow commonsense. The "Scene Graph Generation" module constructs specialized scene graphs by progressively expanding connections from high-confidence nodes. This module ensures semantic consistency by aligning visual, textual, and salient features. Additionally, we incorporate external knowledge to extend abstract semantic concepts and associated facts, enriching the content of generated questions and facilitating the generated question to better follow the commonsense of human. Another module "Question Generation" utilizes the above scene graph as a foundation to search and instantiate for the question. The generated questions will match with the program templates and have diverse inferential paths. Experimental results demonstrate that our method is both effective and highly scalable. The generated questions are controllable in terms of semantic richness and difficulty, exhibiting clear inferential and commonsense properties. Furthermore, we automatically utilize our method to create a large-scale dataset, ICVQA, which includes approximately 160,000 images and 800,000 questionanswer pairs, thereby facilitating further research in VQA and visual dialogue.
资助项目	National Key R&D Program of China[2023YFC2508704] ; National Natural Science Foundation of China[62236008] ; National Natural Science Foundation of China[62022083] ; National Natural Science Foundation of China[U21B2038] ; Fundamental Research Funds for the Central Universities ; Shandong Provincial Key Research and Development Program[2024CXPT011] ; Priority Academic Program Development of QILU Institute of Technology[QIT23NN038]
WOS研究方向	Computer Science ; Telecommunications
语种	英语
WOS记录号	WOS:001598824700008
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
源URL	[http://119.78.100.204/handle/2XEOYT63/41612]
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Wang, Shuhui
作者单位	1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China 3.Pengcheng Lab, Shenzhen 518000, Peoples R China 4.Qilu Inst Technol, Jinan 250200, Shandong, Peoples R China
推荐引用方式 GB/T 7714	Bi, Chao,Wang, Shuhui,Li, Na,et al. Inferential and Commonsense Visual Question Generation[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2025,27:7796-7809.
APA	Bi, Chao,Wang, Shuhui,Li, Na,&Huang, Qingming.(2025).Inferential and Commonsense Visual Question Generation.IEEE TRANSACTIONS ON MULTIMEDIA,27,7796-7809.
MLA	Bi, Chao,et al."Inferential and Commonsense Visual Question Generation".IEEE TRANSACTIONS ON MULTIMEDIA 27(2025):7796-7809.

入库方式： OAI收割

来源：计算技术研究所

下载0

Inferential and Commonsense Visual Question Generation

其他版本