Inferential and Commonsense Visual Question Generation
文献类型:期刊论文
| 作者 | Bi, Chao1,2; Wang, Shuhui2,3; Li, Na4; Huang, Qingming1,2 |
| 刊名 | IEEE TRANSACTIONS ON MULTIMEDIA
![]() |
| 出版日期 | 2025 |
| 卷号 | 27页码:7796-7809 |
| 关键词 | Visual question generation visual question answering multimodal datasets knowledge and inference Visual question generation visual question answering multimodal datasets knowledge and inference |
| ISSN号 | 1520-9210 |
| DOI | 10.1109/TMM.2025.3604975 |
| 英文摘要 | The Visual Question Generation (VQG) task generally aims to produce questions based on images in natural language. Existing studies often handle VQG as a reverse Visual Question Answering (VQA), training data-driven generators on VQA datasets. However, this solution pipeline struggles to generate high-quality questions that effectively challenge robots and humans, even by leveraging the most advanced large-scale foundational models. There are also some other VQG methods depending on elaborate and costly manual preprocessing heavily. To address these limitations, we propose a novel method with a two-module framework for automatically generating inferential visual questions that also follow commonsense. The "Scene Graph Generation" module constructs specialized scene graphs by progressively expanding connections from high-confidence nodes. This module ensures semantic consistency by aligning visual, textual, and salient features. Additionally, we incorporate external knowledge to extend abstract semantic concepts and associated facts, enriching the content of generated questions and facilitating the generated question to better follow the commonsense of human. Another module "Question Generation" utilizes the above scene graph as a foundation to search and instantiate for the question. The generated questions will match with the program templates and have diverse inferential paths. Experimental results demonstrate that our method is both effective and highly scalable. The generated questions are controllable in terms of semantic richness and difficulty, exhibiting clear inferential and commonsense properties. Furthermore, we automatically utilize our method to create a large-scale dataset, ICVQA, which includes approximately 160,000 images and 800,000 questionanswer pairs, thereby facilitating further research in VQA and visual dialogue. |
| 资助项目 | National Key R&D Program of China[2023YFC2508704] ; National Natural Science Foundation of China[62236008] ; National Natural Science Foundation of China[62022083] ; National Natural Science Foundation of China[U21B2038] ; Fundamental Research Funds for the Central Universities ; Shandong Provincial Key Research and Development Program[2024CXPT011] ; Priority Academic Program Development of QILU Institute of Technology[QIT23NN038] |
| WOS研究方向 | Computer Science ; Telecommunications |
| 语种 | 英语 |
| WOS记录号 | WOS:001598824700008 |
| 出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
| 源URL | [http://119.78.100.204/handle/2XEOYT63/41612] ![]() |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Wang, Shuhui |
| 作者单位 | 1.Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 101408, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China 3.Pengcheng Lab, Shenzhen 518000, Peoples R China 4.Qilu Inst Technol, Jinan 250200, Shandong, Peoples R China |
| 推荐引用方式 GB/T 7714 | Bi, Chao,Wang, Shuhui,Li, Na,et al. Inferential and Commonsense Visual Question Generation[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2025,27:7796-7809. |
| APA | Bi, Chao,Wang, Shuhui,Li, Na,&Huang, Qingming.(2025).Inferential and Commonsense Visual Question Generation.IEEE TRANSACTIONS ON MULTIMEDIA,27,7796-7809. |
| MLA | Bi, Chao,et al."Inferential and Commonsense Visual Question Generation".IEEE TRANSACTIONS ON MULTIMEDIA 27(2025):7796-7809. |
入库方式: OAI收割
来源:计算技术研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。

