中国科学院机构知识库网格系统: Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

文献类型：期刊论文


作者	Liu, Jiawei; Wang, Weining; Chen, Sihan; Zhu, Xinxin; Liu, Jing
刊名	IEEE Transactions on Multimedia
出版日期	2023-04-03
页码	1 - 13
关键词	Text-guided sounding-video generation Videoaudio representation Contrastive learning Transformer
文献子类	SCI
英文摘要	As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text-guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio mel-spectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSet-Cap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing text-to-video generation methods as well as audio generation methods on Kinetics and VAS datasets.
语种	英语
源URL	[http://ir.ia.ac.cn/handle/173211/51597]
专题	紫东太初大模型研究中心
通讯作者	Liu, Jing
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	Liu, Jiawei,Wang, Weining,Chen, Sihan,et al. Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation[J]. IEEE Transactions on Multimedia,2023:1 - 13.
APA	Liu, Jiawei,Wang, Weining,Chen, Sihan,Zhu, Xinxin,&Liu, Jing.(2023).Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation.IEEE Transactions on Multimedia,1 - 13.
MLA	Liu, Jiawei,et al."Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation".IEEE Transactions on Multimedia (2023):1 - 13.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。