中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Inferential and Commonsense Visual Question Generation

文献类型:期刊论文

作者Bi, Chao1,2; Wang, Shuhui2,3; Li, Na4; Huang, Qingming1,2
刊名IEEE TRANSACTIONS ON MULTIMEDIA
出版日期2025
卷号27页码:7796-7809
关键词Visual question generation visual question answering multimodal datasets knowledge and inference Visual question generation visual question answering multimodal datasets knowledge and inference
ISSN号1520-9210
DOI10.1109/TMM.2025.3604975
英文摘要The Visual Question Generation (VQG) task generally aims to produce questions based on images in natural language. Existing studies often handle VQG as a reverse Visual Question Answering (VQA), training data-driven generators on VQA datasets. However, this solution pipeline struggles to generate high-quality questions that effectively challenge robots and humans, even by leveraging the most advanced large-scale foundational models. There are also some other VQG methods depending on elaborate and costly manual preprocessing heavily. To address these limitations, we propose a novel method with a two-module framework for automatically generating inferential visual questions that also follow commonsense. The "Scene Graph Generation" module constructs specialized scene graphs by progressively expanding connections from high-confidence nodes. This module ensures semantic consistency by aligning visual, textual, and salient features. Additionally, we incorporate external knowledge to extend abstract semantic concepts and associated facts, enriching the content of generated questions and facilitating the generated question to better follow the commonsense of human. Another module "Question Generation" utilizes the above scene graph as a foundation to search and instantiate for the question. The generated questions will match with the program templates and have diverse inferential paths. Experimental results demonstrate that our method is both effective and highly scalable. The generated questions are controllable in terms of semantic richness and difficulty, exhibiting clear inferential and commonsense properties. Furthermore, we automatically utilize our method to create a large-scale dataset, ICVQA, which includes approximately 160,000 images and 800,000 questionanswer pairs, thereby facilitating further research in VQA and visual dialogue.
资助项目National Key R&D Program of China[2023YFC2508704] ; National Natural Science Foundation of China[62236008] ; National Natural Science Foundation of China[62022083] ; National Natural Science Foundation of China[U21B2038] ; Fundamental Research Funds for the Central Universities ; Shandong Provincial Key Research and Development Program[2024CXPT011] ; Priority Academic Program Development of QILU Institute of Technology[QIT23NN038]
WOS研究方向Computer Science ; Telecommunications
语种英语
WOS记录号WOS:001598824700008
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
源URL[http://119.78.100.204/handle/2XEOYT63/41612]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Wang, Shuhui
作者单位1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China
2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
3.Pengcheng Lab, Shenzhen 518000, Peoples R China
4.Qilu Inst Technol, Jinan 250200, Shandong, Peoples R China
推荐引用方式
GB/T 7714
Bi, Chao,Wang, Shuhui,Li, Na,et al. Inferential and Commonsense Visual Question Generation[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2025,27:7796-7809.
APA Bi, Chao,Wang, Shuhui,Li, Na,&Huang, Qingming.(2025).Inferential and Commonsense Visual Question Generation.IEEE TRANSACTIONS ON MULTIMEDIA,27,7796-7809.
MLA Bi, Chao,et al."Inferential and Commonsense Visual Question Generation".IEEE TRANSACTIONS ON MULTIMEDIA 27(2025):7796-7809.

入库方式: OAI收割

来源:计算技术研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。