中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Transformer-based Spiking Neural Networks for Multimodal Audio-Visual Classification

文献类型:期刊论文

作者Guo LY(郭凌月)2,3; Zeyu Gao2,3; Jinye Qu2,3; Suiwu Zheng2,3; Runhao Jiang1; Yanfeng Lu2,3; Hong Qiao2,3
刊名IEEE Transactions on Cognitive and Developmental Systems
出版日期2023
页码DOI 10.1109/TCDS.2023.3327081
英文摘要

The spiking neural networks (SNNs), as brain- inspired neural networks, have received noteworthy attention due to their advantages of low power consumption, high parallelism, and high fault tolerance. While SNNs have shown promising results in uni-modal data tasks, their deployment in multi-modal audiovisual classification remains limited, and the effectiveness of capturing correlations between visual and audio modalities in SNNs needs improvement. To address these challenges, we propose a novel model called Spiking Multi-Model Transformer (SMMT) that combines SNNs and Transformers for multi-modal audiovisual classification. The SMMT model integrates uni- modal sub-networks for visual and auditory modalities with a novel Spiking Cross-Attention module for fusion, enhancing the correlation between visual and audio modalities. This approach leads to competitive accuracy in multi-modal classification tasks with low energy consumption, making it an effective and energy- efficient solution. Extensive experiments on a public event-based dataset(N-TIDIGIT&MNIST-DVS) and two self-made audiovisual datasets of real-world objects(CIFAR10-AV and UrbanSound8K- AV) demonstrate the effectiveness and energy efficiency of the proposed SMMT model in multi-modal audio-visual classification tasks. Our constructed multi-modal audiovisual datasets can be accessed at https://github.com/Guo-Lingyue/SMMT.

源URL[http://ir.ia.ac.cn/handle/173211/56541]  
专题多模态人工智能系统全国重点实验室
通讯作者Yanfeng Lu
作者单位1.College of Computer Science and Technology, Zhejiang University
2.the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Science (CASIA)
3.the University of Chinese Academy of Sciences (UCAS)
推荐引用方式
GB/T 7714
Guo LY,Zeyu Gao,Jinye Qu,et al. Transformer-based Spiking Neural Networks for Multimodal Audio-Visual Classification[J]. IEEE Transactions on Cognitive and Developmental Systems,2023:DOI 10.1109/TCDS.2023.3327081.
APA Guo LY.,Zeyu Gao.,Jinye Qu.,Suiwu Zheng.,Runhao Jiang.,...&Hong Qiao.(2023).Transformer-based Spiking Neural Networks for Multimodal Audio-Visual Classification.IEEE Transactions on Cognitive and Developmental Systems,DOI 10.1109/TCDS.2023.3327081.
MLA Guo LY,et al."Transformer-based Spiking Neural Networks for Multimodal Audio-Visual Classification".IEEE Transactions on Cognitive and Developmental Systems (2023):DOI 10.1109/TCDS.2023.3327081.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。