中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

文献类型:期刊论文

作者Zhang, Kaiwen1; Zhao, Kunchen1; Tian, Yunong2
刊名MATHEMATICS
出版日期2024-07-01
卷号12期号:14页码:16
关键词audio-visual zero-shot learning transformer
DOI10.3390/math12142200
通讯作者Tian, Yunong(yunong.tian@ia.ac.cn)
英文摘要Zero-shot learning (ZSL) enables models to recognize categories not encountered during training, which is crucial for categories with limited data. Existing methods overlook efficient temporal modeling in multimodal data. This paper proposes a Temporal-Semantic Aligning and Reasoning Transformer (TSART) for spatio-temporal modeling. TSART uses the pre-trained SeLaVi network to extract audio and visual features and explores the semantic information of these modalities through audio and visual encoders. It incorporates a temporal information reasoning module to enhance the capture of temporal features in audio, and a cross-modal reasoning module to effectively integrate audio and visual information, establishing a robust joint embedding representation. Our experimental results validate the effectiveness of this approach, demonstrating outstanding Generalized Zero-Shot Learning (GZSL) performance on the UCF101 Generalized Zero-Shot Learning (UCF-GZSL), VGGSound-GZSL, and ActivityNet-GZSL datasets, with notable improvements in the Harmonic Mean (HM) evaluation. These results indicate that TSART has great potential in handling complex spatio-temporal information and multimodal fusion.
资助项目National Natural Science Foundation of China[62206275]
WOS研究方向Mathematics
语种英语
WOS记录号WOS:001277575400001
出版者MDPI
资助机构National Natural Science Foundation of China
源URL[http://ir.ia.ac.cn/handle/173211/59367]  
专题精密感知与控制研究中心_精密感知与控制
通讯作者Tian, Yunong
作者单位1.Beijing Forestry Univ, Sch Informat Sci & Technol, Beijing 100083, Peoples R China
2.Chinese Acad Sci, Inst Automat, CAS Engn Lab Intelligent Ind Vis, Beijing 100190, Peoples R China
推荐引用方式
GB/T 7714
Zhang, Kaiwen,Zhao, Kunchen,Tian, Yunong. Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning[J]. MATHEMATICS,2024,12(14):16.
APA Zhang, Kaiwen,Zhao, Kunchen,&Tian, Yunong.(2024).Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning.MATHEMATICS,12(14),16.
MLA Zhang, Kaiwen,et al."Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning".MATHEMATICS 12.14(2024):16.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。