中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

文献类型:期刊论文

作者Luo, Huaishao4; Ji, Lei1,2,5; Zhong, Ming3; Chen, Yang3; Lei, Wen3; Duan, Nan2; Li, Tianrui4
刊名NEUROCOMPUTING
出版日期2022-10-07
卷号508页码:293-304
ISSN号0925-2312
关键词Video retrieval Video captioning CLIP
DOI10.1016/j.neucom.2022.07.028
英文摘要Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved.
资助项目National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170]
WOS研究方向Computer Science
语种英语
出版者ELSEVIER
WOS记录号WOS:000848021200006
源URL[http://119.78.100.204/handle/2XEOYT63/19440]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Luo, Huaishao; Ji, Lei
作者单位1.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
2.Microsoft Res Asia, Beijing, Peoples R China
3.Microsoft STCA, Beijing, Peoples R China
4.Southwest Jiaotong Univ, Chengdu, Peoples R China
5.Univ Chinese Acad Sci, Beijing, Peoples R China
推荐引用方式
GB/T 7714
Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304.
APA Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304.
MLA Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304.

入库方式: OAI收割

来源:计算技术研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。