Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos
文献类型:期刊论文
作者 | Guyue Hu2![]() |
刊名 | Machine Intelligence Research
![]() |
出版日期 | 2023 |
卷号 | 20期号:2页码:249-262 |
关键词 | Prompt learning video-language pretrained models instructional videos procedure understanding knowledge distilling |
ISSN号 | 2731-538X |
DOI | 10.1007/s11633-022-1409-1 |
英文摘要 | Instructional videos are very useful for completing complex daily tasks, which naturally contain abundant clip-narration pairs. Existing works for procedure understanding are keen on pretraining various video-language models with these pairs and then fine tuning downstream classifiers and localizers in predetermined category space. These video-language models are proficient at representing short-term actions, basic objects, and their combinations, but they are still far from understanding long-term procedures. In addition, the predetermined procedure category faces the problem of combination disaster and is inherently inapt to unseen procedures. Therefore, we propose a novel compositional prompt learning (CPL) framework to understand long-term procedures by prompting short-term video-language models and reformulating several classical procedure understanding tasks into general video-text matching problems. Specifically, the proposed CPL consists of one visual prompt and three compositional textual prompts (including the action prompt, object prompt, and procedure prompt), which could compositionally distill knowledge from short-term video-language models to facilitate long-term procedure understanding. Besides, the task reformulation enables our CPL to perform well in all zero-shot, few shot, and fully-supervised settings. Extensive experiments on two widely-used datasets for procedure understanding demonstrate the effectiveness of the proposed approach. |
源URL | [http://ir.ia.ac.cn/handle/173211/55978] ![]() |
专题 | 自动化研究所_学术期刊_International Journal of Automation and Computing |
作者单位 | 1.The 15th Research Institute of China Electronics Technology Group Corporation, Beijing 100083, China 2.Nanyang Technological University, Singapore 639798, Singapore |
推荐引用方式 GB/T 7714 | Guyue Hu,Bin He,Hanwang Zhang. Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos[J]. Machine Intelligence Research,2023,20(2):249-262. |
APA | Guyue Hu,Bin He,&Hanwang Zhang.(2023).Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos.Machine Intelligence Research,20(2),249-262. |
MLA | Guyue Hu,et al."Compositional Prompting Video-language Models to Understand Procedure in Instructional Videos".Machine Intelligence Research 20.2(2023):249-262. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。