Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
文献类型:期刊论文
| 作者 | Liu, Jiawei ; Wang, Weining ; Chen, Sihan; Zhu, Xinxin ; Liu, Jing
|
| 刊名 | IEEE Transactions on Multimedia
![]() |
| 出版日期 | 2023-04-03 |
| 页码 | 1 - 13 |
| 关键词 | Text-guided sounding-video generation Videoaudio representation Contrastive learning Transformer |
| 文献子类 | SCI |
| 英文摘要 | As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text-guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to-video generation methods as well as audio generation methods on Kinetics and VAS datasets. |
| 语种 | 英语 |
| 源URL | [http://ir.ia.ac.cn/handle/173211/51597] ![]() |
| 专题 | 紫东太初大模型研究中心 |
| 通讯作者 | Liu, Jing |
| 作者单位 | 中国科学院自动化研究所 |
| 推荐引用方式 GB/T 7714 | Liu, Jiawei,Wang, Weining,Chen, Sihan,et al. Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation[J]. IEEE Transactions on Multimedia,2023:1 - 13. |
| APA | Liu, Jiawei,Wang, Weining,Chen, Sihan,Zhu, Xinxin,&Liu, Jing.(2023).Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation.IEEE Transactions on Multimedia,1 - 13. |
| MLA | Liu, Jiawei,et al."Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation".IEEE Transactions on Multimedia (2023):1 - 13. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


