中国科学院机构知识库网格系统: Visually Guided Sound Source Separation With Audio-Visual Predictive Coding

Visually Guided Sound Source Separation With Audio-Visual Predictive Coding

文献类型：期刊论文


作者	Song, Zengjie 2; Zhang, Zhaoxiang1,3
刊名	IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
出版日期	2023-07-12
页码	15
关键词	Feature fusion multimodal learning predictive coding (PC) self-supervised learning sound source separation
ISSN号	2162-237X
DOI	10.1109/TNNLS.2023.3288022
通讯作者	Zhang, Zhaoxiang(zhaoxiang.zhang@ia.ac.cn)
英文摘要	The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.
WOS关键词	INTEGRATION
资助项目	Major Project for New Generation of AI[2018AAA0100400] ; National Natural Science Foundation of China[61836014] ; National Natural Science Foundation of China[U21B2042] ; National Natural Science Foundation of China[62072457] ; National Natural Science Foundation of China[62006231] ; National Natural Science Foundation of China[61976174] ; China Postdoctoral Science Foundation[2021M703489]
WOS研究方向	Computer Science ; Engineering
语种	英语
WOS记录号	WOS:001030674000001
出版者	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
资助机构	Major Project for New Generation of AI ; National Natural Science Foundation of China ; China Postdoctoral Science Foundation
源URL	[http://ir.ia.ac.cn/handle/173211/53783]
专题	多模态人工智能系统全国重点实验室
通讯作者	Zhang, Zhaoxiang
作者单位	1.Chinese Acad Sci, Hong Kong Inst Sci & Innovat, Ctr Artificial Intelligence & Robot, Hong Kong, Peoples R China 2.Xi An Jiao Tong Univ, Sch Math & Stat, Xian 710049, Peoples R China 3.Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Song, Zengjie,Zhang, Zhaoxiang. Visually Guided Sound Source Separation With Audio-Visual Predictive Coding[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,2023:15.
APA	Song, Zengjie,&Zhang, Zhaoxiang.(2023).Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,15.
MLA	Song, Zengjie,et al."Visually Guided Sound Source Separation With Audio-Visual Predictive Coding".IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023):15.

入库方式： OAI收割

来源：自动化研究所

下载0

Visually Guided Sound Source Separation With Audio-Visual Predictive Coding

其他版本