Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
文献类型:期刊论文
作者 | Song, Zengjie2; Zhang, Zhaoxiang1,3![]() |
刊名 | IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
![]() |
出版日期 | 2023-07-12 |
页码 | 15 |
关键词 | Feature fusion multimodal learning predictive coding (PC) self-supervised learning sound source separation |
ISSN号 | 2162-237X |
DOI | 10.1109/TNNLS.2023.3288022 |
通讯作者 | Zhang, Zhaoxiang(zhaoxiang.zhang@ia.ac.cn) |
英文摘要 | The framework of visually guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such a divide-and-conquer paradigm is parameter-inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this article presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter-efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding (PC)-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via copredicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding. |
WOS关键词 | INTEGRATION |
资助项目 | Major Project for New Generation of AI[2018AAA0100400] ; National Natural Science Foundation of China[61836014] ; National Natural Science Foundation of China[U21B2042] ; National Natural Science Foundation of China[62072457] ; National Natural Science Foundation of China[62006231] ; National Natural Science Foundation of China[61976174] ; China Postdoctoral Science Foundation[2021M703489] |
WOS研究方向 | Computer Science ; Engineering |
语种 | 英语 |
WOS记录号 | WOS:001030674000001 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
资助机构 | Major Project for New Generation of AI ; National Natural Science Foundation of China ; China Postdoctoral Science Foundation |
源URL | [http://ir.ia.ac.cn/handle/173211/53783] ![]() |
专题 | 多模态人工智能系统全国重点实验室 |
通讯作者 | Zhang, Zhaoxiang |
作者单位 | 1.Chinese Acad Sci, Hong Kong Inst Sci & Innovat, Ctr Artificial Intelligence & Robot, Hong Kong, Peoples R China 2.Xi An Jiao Tong Univ, Sch Math & Stat, Xian 710049, Peoples R China 3.Chinese Acad Sci, Inst Automat, Ctr Res Intelligent Percept & Comp, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Song, Zengjie,Zhang, Zhaoxiang. Visually Guided Sound Source Separation With Audio-Visual Predictive Coding[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,2023:15. |
APA | Song, Zengjie,&Zhang, Zhaoxiang.(2023).Visually Guided Sound Source Separation With Audio-Visual Predictive Coding.IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS,15. |
MLA | Song, Zengjie,et al."Visually Guided Sound Source Separation With Audio-Visual Predictive Coding".IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023):15. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。