中国科学院机构知识库网格系统: CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

文献类型：期刊论文


作者	Luo, Huaishao 4; Ji, Lei 1,2,5; Zhong, Ming 3; Chen, Yang 3; Lei, Wen 3; Duan, Nan 2; Li, Tianrui 4
刊名	NEUROCOMPUTING
出版日期	2022-10-07
卷号	508 页码:293-304
关键词	Video retrieval Video captioning CLIP
ISSN号	0925-2312
DOI	10.1016/j.neucom.2022.07.028
英文摘要	Video clip retrieval and captioning tasks play an essential role in multimodal research and are the fundamental research problem for multimodal understanding and generation. The CLIP (Contrastive LanguageImage Pre-training) model has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the image-text pretrained CLIP model to video-text tasks in an end-to-end manner. Furthermore, we conduct several empirical studies including 1) Whether image feature is enough for video-text retrieval and captioning? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text datasets, including MSR-VTT, MSVD, LSMDC, and DiDeMo for multimodal understanding and generation tasks.(c) 2022 Elsevier B.V. All rights reserved.
资助项目	National Science Foundation of China[62176221] ; National Science Foundation of China[61876158] ; National Science Foundation of China[61806170]
WOS研究方向	Computer Science
语种	英语
WOS记录号	WOS:000848021200006
出版者	ELSEVIER
源URL	[http://119.78.100.204/handle/2XEOYT63/19440]
专题	中国科学院计算技术研究所期刊论文_英文
通讯作者	Luo, Huaishao; Ji, Lei
作者单位	1.Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China 2.Microsoft Res Asia, Beijing, Peoples R China 3.Microsoft STCA, Beijing, Peoples R China 4.Southwest Jiaotong Univ, Chengdu, Peoples R China 5.Univ Chinese Acad Sci, Beijing, Peoples R China
推荐引用方式 GB/T 7714	Luo, Huaishao,Ji, Lei,Zhong, Ming,et al. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning[J]. NEUROCOMPUTING,2022,508:293-304.
APA	Luo, Huaishao.,Ji, Lei.,Zhong, Ming.,Chen, Yang.,Lei, Wen.,...&Li, Tianrui.(2022).CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning.NEUROCOMPUTING,508,293-304.
MLA	Luo, Huaishao,et al."CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning".NEUROCOMPUTING 508(2022):293-304.

入库方式： OAI收割

来源：计算技术研究所

下载0

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

其他版本