中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering

文献类型:期刊论文

作者Song, Yaguang1,2,3; Yang, Xiaoshan1,2,3; Wang, Yaowei1,2,3; Xu, Changsheng3
刊名IEEE Transactions on Multimedia
出版日期2023-05-05
页码1-15
关键词Multi-modal Foundation Model Out-of-Distribution Generalization Visual Question Answering Knowledge Distillation
DOI10.1109/TMM.2023.3272224
文献子类期刊论文
英文摘要

With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the “Pre-training and Fine-tuning” paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this paper, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method.

URL标识查看原文
语种英语
源URL[http://ir.ia.ac.cn/handle/173211/51954]  
专题自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队
多模态人工智能系统全国重点实验室
通讯作者Xu, Changsheng
作者单位1.State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences
2.School of Artificial Intelligence, University of Chinese Academy of Sciences
3.Peng Cheng Laboratory
推荐引用方式
GB/T 7714
Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,et al. Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering[J]. IEEE Transactions on Multimedia,2023:1-15.
APA Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,&Xu, Changsheng.(2023).Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering.IEEE Transactions on Multimedia,1-15.
MLA Song, Yaguang,et al."Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering".IEEE Transactions on Multimedia (2023):1-15.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。