中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

文献类型:期刊论文

作者Liu, Jiawei; Wang, Weining; Chen, Sihan; Zhu, Xinxin; Liu, Jing
刊名IEEE Transactions on Multimedia
出版日期2023-04-03
页码1 - 13
关键词Text-guided sounding-video generation Videoaudio representation Contrastive learning Transformer
文献子类SCI
英文摘要

As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text-guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to-video generation methods as well as audio generation methods on Kinetics and VAS datasets.

语种英语
源URL[http://ir.ia.ac.cn/handle/173211/51597]  
专题紫东太初大模型研究中心
通讯作者Liu, Jing
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
Liu, Jiawei,Wang, Weining,Chen, Sihan,et al. Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation[J]. IEEE Transactions on Multimedia,2023:1 - 13.
APA Liu, Jiawei,Wang, Weining,Chen, Sihan,Zhu, Xinxin,&Liu, Jing.(2023).Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation.IEEE Transactions on Multimedia,1 - 13.
MLA Liu, Jiawei,et al."Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation".IEEE Transactions on Multimedia (2023):1 - 13.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。