Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning
文献类型:期刊论文
作者 | Zhang, Kaiwen1; Zhao, Kunchen1; Tian, Yunong2![]() |
刊名 | MATHEMATICS
![]() |
出版日期 | 2024-07-01 |
卷号 | 12期号:14页码:16 |
关键词 | audio-visual zero-shot learning transformer |
DOI | 10.3390/math12142200 |
通讯作者 | Tian, Yunong(yunong.tian@ia.ac.cn) |
英文摘要 | Zero-shot learning (ZSL) enables models to recognize categories not encountered during training, which is crucial for categories with limited data. Existing methods overlook efficient temporal modeling in multimodal data. This paper proposes a Temporal-Semantic Aligning and Reasoning Transformer (TSART) for spatio-temporal modeling. TSART uses the pre-trained SeLaVi network to extract audio and visual features and explores the semantic information of these modalities through audio and visual encoders. It incorporates a temporal information reasoning module to enhance the capture of temporal features in audio, and a cross-modal reasoning module to effectively integrate audio and visual information, establishing a robust joint embedding representation. Our experimental results validate the effectiveness of this approach, demonstrating outstanding Generalized Zero-Shot Learning (GZSL) performance on the UCF101 Generalized Zero-Shot Learning (UCF-GZSL), VGGSound-GZSL, and ActivityNet-GZSL datasets, with notable improvements in the Harmonic Mean (HM) evaluation. These results indicate that TSART has great potential in handling complex spatio-temporal information and multimodal fusion. |
资助项目 | National Natural Science Foundation of China[62206275] |
WOS研究方向 | Mathematics |
语种 | 英语 |
WOS记录号 | WOS:001277575400001 |
出版者 | MDPI |
资助机构 | National Natural Science Foundation of China |
源URL | [http://ir.ia.ac.cn/handle/173211/59367] ![]() |
专题 | 精密感知与控制研究中心_精密感知与控制 |
通讯作者 | Tian, Yunong |
作者单位 | 1.Beijing Forestry Univ, Sch Informat Sci & Technol, Beijing 100083, Peoples R China 2.Chinese Acad Sci, Inst Automat, CAS Engn Lab Intelligent Ind Vis, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Zhang, Kaiwen,Zhao, Kunchen,Tian, Yunong. Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning[J]. MATHEMATICS,2024,12(14):16. |
APA | Zhang, Kaiwen,Zhao, Kunchen,&Tian, Yunong.(2024).Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning.MATHEMATICS,12(14),16. |
MLA | Zhang, Kaiwen,et al."Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning".MATHEMATICS 12.14(2024):16. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。