Dense Modality Interaction Network for Audio-Visual Event Localization
文献类型:期刊论文
作者 | Liu, Shuo2,3; Quan, Weize2,3![]() ![]() ![]() |
刊名 | IEEE Transactions on Multimedia
![]() |
出版日期 | 2022-02 |
页码 | 1-1 |
英文摘要 | Human perception systems can integrate audio and visual information automatically to obtain a profound understanding of real-world events. Accordingly, fusing audio and visual contents is important to solve the audio-visual event (AVE) localization problem. Although most existing works have fused audio and visual modalities to explore their relationship with attention-based networks, we can delve into their relationship more deeply to improve the fusion capability of the two modalities. In this paper, we propose a dense modality interaction network (DMIN) to elegantly leverage audio and visual information by integrating two novel modules, namely, the audio-guided triplet attention (AGTA) module and the dense inter-modality attention (DIMA) module. The AGTA module enables audio information to guide the network to pay more attention to event-relevant visual regions. This guidance is conducted in the channel, temporal, and spatial dimensions, which emphasize informative features, temporal relationships and spatial regions, to boost the capacity of representations. Furthermore, the DIMA module establishes the dense-relationship between audio and visual modalities. Specifically, the DIMA module leverages the information of all channel pairs of audio and visual features to formulate the cross-modality attention weight, which is superior to the multi-head attention module that uses limited information. Moreover, a novel unimodal discrimination loss (UDL) is introduced to exploit the unimodal and fused features together for more exact AVE localization. The experimental results show that our method is remarkably superior to the state-of-the-art methods in fully- and weakly-supervised AVE settings. To further evaluate the model's ability to build audio-visual connections, we design a dense cross modality relation network (DCMR) to solve the cross-modality localization task. DCMR is a simple deformation of a DMIN, and the experimental results further illustrate that DIMA can explore denser relationships between the two modalities. Code is available at https://github.com/weizequan/DMIN.git. |
语种 | 英语 |
源URL | [http://ir.ia.ac.cn/handle/173211/51501] ![]() |
专题 | 多模态人工智能系统全国重点实验室 |
通讯作者 | Yan, Dong-Ming |
作者单位 | 1.Speech Lab, Alibaba Group, Beijing 2.School of Artificial Intelligence, UCAS 3.NLPR, Institute of Automation, Chinese Academy of Science |
推荐引用方式 GB/T 7714 | Liu, Shuo,Quan, Weize,Wang, Chaoqun,et al. Dense Modality Interaction Network for Audio-Visual Event Localization[J]. IEEE Transactions on Multimedia,2022:1-1. |
APA | Liu, Shuo,Quan, Weize,Wang, Chaoqun,Liu, Yuan,Liu, Bin,&Yan, Dong-Ming.(2022).Dense Modality Interaction Network for Audio-Visual Event Localization.IEEE Transactions on Multimedia,1-1. |
MLA | Liu, Shuo,et al."Dense Modality Interaction Network for Audio-Visual Event Localization".IEEE Transactions on Multimedia (2022):1-1. |
入库方式: OAI收割
来源:自动化研究所
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。