Focus and Align: Learning Tube Tokens for Video-Language Pre-Training
文献类型:期刊论文
| 作者 | Zhu, Yongqing2,3; Li, Xiangyang2,3; Zheng, Mao1; Yang, Jiahao2,3; Wang, Zihan2,3; Guo, Xiaoqian2,3; Chai, Zifeng1; Yuan, Yuchen1; Jiang, Shuqiang2,3 |
| 刊名 | IEEE TRANSACTIONS ON MULTIMEDIA
![]() |
| 出版日期 | 2023 |
| 卷号 | 25页码:8036-8050 |
| 关键词 | Electron tubes Semantics Visualization Feature extraction Task analysis Transformers Detectors Local alignment mechanism semantic centers tube tokens video-language pre-training |
| ISSN号 | 1520-9210 |
| DOI | 10.1109/TMM.2022.3231108 |
| 英文摘要 | Video-language pre-training (VLP) has attracted increasing attention for cross-modality understanding tasks. To enhance visual representations, recent works attempt to adopt transformer-based architectures as video encoders. These works usually focus on the visual representations of the sampled frames. Compared with frame representations, frame patches incorporate more fine-grained spatio-temporal information, which could lead to a better understanding of video contents. However, how to exploit the spatio-temporal information within frame patches for VLP has been less investigated. In this work, we propose a method to learn tube tokens to model the key spatio-temporal information from frame patches. To this end, multiple semantic centers are introduced to focus on the underlying patterns of frame patches. Based on each semantic center, the spatio-temporal information within frame patches is integrated into a unique tube token. Complementary to frame representations, tube tokens provide detailed clues of video contents. Furthermore, to better align the generated tube tokens and the contents of descriptions, a local alignment mechanism is introduced. The experiments based on a variety of downstream tasks demonstrate the effectiveness of the proposed method. |
| 资助项目 | National Natural Science Foundation of China |
| WOS研究方向 | Computer Science ; Telecommunications |
| 语种 | 英语 |
| WOS记录号 | WOS:001125902000019 |
| 出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
| 源URL | [http://119.78.100.204/handle/2XEOYT63/38437] ![]() |
| 专题 | 中国科学院计算技术研究所期刊论文_英文 |
| 通讯作者 | Jiang, Shuqiang |
| 作者单位 | 1.Tencent, Dept Machine Learning Platform, Beijing 100193, Peoples R China 2.Univ Chinese Acad Sci, Beijing 100049, Peoples R China 3.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China |
| 推荐引用方式 GB/T 7714 | Zhu, Yongqing,Li, Xiangyang,Zheng, Mao,et al. Focus and Align: Learning Tube Tokens for Video-Language Pre-Training[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2023,25:8036-8050. |
| APA | Zhu, Yongqing.,Li, Xiangyang.,Zheng, Mao.,Yang, Jiahao.,Wang, Zihan.,...&Jiang, Shuqiang.(2023).Focus and Align: Learning Tube Tokens for Video-Language Pre-Training.IEEE TRANSACTIONS ON MULTIMEDIA,25,8036-8050. |
| MLA | Zhu, Yongqing,et al."Focus and Align: Learning Tube Tokens for Video-Language Pre-Training".IEEE TRANSACTIONS ON MULTIMEDIA 25(2023):8036-8050. |
入库方式: OAI收割
来源:计算技术研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。

