Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering
文献类型:期刊论文
作者 | Song, Yaguang1,2,3; Yang, Xiaoshan1,2,3; Wang, Yaowei1,2,3; Xu, Changsheng3 |
刊名 | IEEE Transactions on Multimedia |
出版日期 | 2023-05-05 |
页码 | 1-15 |
关键词 | Multi-modal Foundation Model Out-of-Distribution Generalization Visual Question Answering Knowledge Distillation |
DOI | 10.1109/TMM.2023.3272224 |
文献子类 | 期刊论文 |
英文摘要 | With the emergence of large-scale multi-modal foundation models, significant improvements have been made towards Visual Question Answering (VQA) in recent years via the “Pre-training and Fine-tuning” paradigm. However, the fine-tuned VQA model, which is more specialized for the downstream training data, may fail to generalize well when there is a distribution shift between the training and test data, which is defined as the Out-of-Distribution (OOD) problem. An intuitive way to solve this problem is to transfer the common knowledge from the foundation model to the fine-tuned VQA model via knowledge distillation for better generalization. However, the generality of distilled knowledge based on the task-specific training data is questionable due to the bias between the training and test data. An ideal way is to adopt the pre-training data to distill the common knowledge shared by the training and OOD test samples, which however is impracticable due to the huge size of pre-training data. Based on the above considerations, in this paper, we propose a method, named Pre-training-like Knowledge Distillation (PKD), to imitate the pre-training feature distribution and leverage it to distill the common knowledge, which can improve the generalization performance of the fine-tuned model for OOD VQA. Specifically, we first leverage the in-domain VQA data as guidance and adopt two cross-modal feature prediction networks, which are learned under the supervision of image-text matching loss and feature divergence loss, to estimate pre-training-like vision and text features. Next, we conduct feature-level distillation by explicitly integrating the downstream VQA input features with the predicted pre-training-like features through a memory mechanism. In the meantime, we also conduct model-level distillation by constraining the image-text matching output of the downstream VQA model and the output of the foundation model for the pre-training-like image and text features. Extensive experiments on the VQA-CP v2 and VQA v2 datasets demonstrate the effectiveness of our method. |
URL标识 | 查看原文 |
语种 | 英语 |
源URL | [http://ir.ia.ac.cn/handle/173211/51954] |
专题 | 自动化研究所_模式识别国家重点实验室_多媒体计算与图形学团队 多模态人工智能系统全国重点实验室 |
通讯作者 | Xu, Changsheng |
作者单位 | 1.State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences 2.School of Artificial Intelligence, University of Chinese Academy of Sciences 3.Peng Cheng Laboratory |
推荐引用方式 GB/T 7714 | Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,et al. Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering[J]. IEEE Transactions on Multimedia,2023:1-15. |
APA | Song, Yaguang,Yang, Xiaoshan,Wang, Yaowei,&Xu, Changsheng.(2023).Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering.IEEE Transactions on Multimedia,1-15. |
MLA | Song, Yaguang,et al."Recovering Generalization via Pre-training-like Knowledge Distillation for Out-of-Distribution Visual Question Answering".IEEE Transactions on Multimedia (2023):1-15. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。