CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning
文献类型:期刊论文
作者 | Luo, Huaishao4; Ji, Lei1,2,5; Zhong, Ming3; Chen, Yang3; Lei, Wen3; Duan, Nan2; Li, Tianrui4 |
刊名 | NEUROCOMPUTING |
出版日期 | 2022-10-07 |
卷号 | 508页码:293-304 |
ISSN号 | 0925-2312 |
关键词 | Video retrieval Video captioning CLIP |
DOI | 10.1016/j.neucom.2022.07.028 |
英文摘要 | Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved. |
资助项目 | National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170] |
WOS研究方向 | Computer Science |
语种 | 英语 |
出版者 | ELSEVIER |
WOS记录号 | WOS:000848021200006 |
源URL | [http://119.78.100.204/handle/2XEOYT63/19440] |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Luo, Huaishao; Ji, Lei |
作者单位 | 1.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China 2.Microsoft Res Asia, Beijing, Peoples R China 3.Microsoft STCA, Beijing, Peoples R China 4.Southwest Jiaotong Univ, Chengdu, Peoples R China 5.Univ Chinese Acad Sci, Beijing, Peoples R China |
推荐引用方式 GB/T 7714 | Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304. |
APA | Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304. |
MLA | Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304. |
入库方式: OAI收割
来源:计算技术研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。