Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation
文献类型:期刊论文
作者 | Liu, Jiawei![]() ![]() ![]() ![]() |
刊名 | IEEE Transactions on Multimedia
![]() |
出版日期 | 2023-04-03 |
页码 | 1 - 13 |
关键词 | Text-guided sounding-video generation Videoaudio representation Contrastive learning Transformer |
文献子类 | SCI |
英文摘要 | As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text-guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to-video generation methods as well as audio generation methods on Kinetics and VAS datasets. |
语种 | 英语 |
源URL | [http://ir.ia.ac.cn/handle/173211/51597] ![]() |
专题 | 紫东太初大模型研究中心 |
通讯作者 | Liu, Jing |
作者单位 | 中国科学院自动化研究所 |
推荐引用方式 GB/T 7714 | Liu, Jiawei,Wang, Weining,Chen, Sihan,et al. Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation[J]. IEEE Transactions on Multimedia,2023:1 - 13. |
APA | Liu, Jiawei,Wang, Weining,Chen, Sihan,Zhu, Xinxin,&Liu, Jing.(2023).Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation.IEEE Transactions on Multimedia,1 - 13. |
MLA | Liu, Jiawei,et al."Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation".IEEE Transactions on Multimedia (2023):1 - 13. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。