Coarse-to-Fine Recurrently Aligned Transformer with Balance Tokens for Video Moment Retrieval and Highlight Detection
文献类型:会议论文
作者 | Pan Yi2![]() ![]() ![]() ![]() ![]() |
出版日期 | 2024-06 |
会议日期 | 2024-6 |
会议地点 | 日本横滨 |
英文摘要 | Video moment retrieval (MR) and highlight de tection (HD) are two user-oriented video understanding tasks aimed at extracting query-dependent or highlighted moments to provide valuable content for users. While many recent works have proposed solutions for the joint task of MR and HD leveraging transformer architecture, we argue that existing approaches have not adequately aligned the video and text modalities using basic transformer encoders, and have overlooked the misalignment between irrelevant video clips and text queries. To address these issues, we introduce COREBA: a Coarse-to-Fine Recurrently Aligned Transformer with Balance Tokens. Firstly, we design a plug-and-play Coarse-to-Fine Cross-modal interaction (CFC) module, replacing the original transformer encoder to align the two modalities in a progressive manner. Secondly, we present a novel Recurrent Alignment Mechanism (RAM) to deeply align the modalities in a recurrent fashion. Thirdly, to mitigate the misalignment problem, we append text queries with learnable Balance Tokens to restrict the text information fused with irrelevant clips. Extensive experiments validate the effectiveness and superiority of our proposed method. |
会议录出版者 | IJCNN |
源URL | [http://ir.ia.ac.cn/handle/173211/57093] ![]() |
专题 | 多模态人工智能系统全国重点实验室 |
通讯作者 | Chang Hui |
作者单位 | 1.中国人民解放军总医院第一医学中心 2.中国科学院自动化研究所 |
推荐引用方式 GB/T 7714 | Pan Yi,Zhang Yujia,Chang Hui,et al. Coarse-to-Fine Recurrently Aligned Transformer with Balance Tokens for Video Moment Retrieval and Highlight Detection[C]. 见:. 日本横滨. 2024-6. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。