中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos

文献类型:期刊论文

作者Guyue Hu2; Bin He1; Hanwang Zhang2
刊名Machine Intelligence Research
出版日期2023
卷号20期号:2页码:249-262
关键词Prompt learning video-language pretrained models instructional videos procedure understanding knowledge distilling
ISSN号2731-538X
DOI10.1007/s11633-022-1409-1
英文摘要Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach.
源URL[http://ir.ia.ac.cn/handle/173211/55978]  
专题自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位1.The 15th Research Institute of China Electronics Technology Group Corporation, Beijing 100083, China
2.Nanyang Technological University, Singapore 639798, Singapore
推荐引用方式
GB/T 7714
Guyue Hu,Bin He,Hanwang Zhang. Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos[J]. Machine Intelligence Research,2023,20(2):249-262.
APA Guyue Hu,Bin He,&Hanwang Zhang.(2023).Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos.Machine Intelligence Research,20(2),249-262.
MLA Guyue Hu,et al."Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos".Machine Intelligence Research 20.2(2023):249-262.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。