中国科学院机构知识库网格系统: Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

文献类型：期刊论文


作者	Zhang, Kaiwen 1; Zhao, Kunchen 1; Tian, Yunong2
刊名	MATHEMATICS
出版日期	2024-07-01
卷号	12 期号:14 页码:16
关键词	audio-visual zero-shot learning transformer
DOI	10.3390/math12142200
通讯作者	Tian, Yunong(yunong.tian@ia.ac.cn)
英文摘要	Zero-shot learning (ZSL) enables models to recognize categories not encountered during training, which is crucial for categories with limited data. Existing methods overlook efficient temporal modeling in multimodal data. This paper proposes a Temporal-Semantic Aligning and Reasoning Transformer (TSART) for spatio-temporal modeling. TSART uses the pre-trained SeLaVi network to extract audio and visual features and explores the semantic information of these modalities through audio and visual encoders. It incorporates a temporal information reasoning module to enhance the capture of temporal features in audio, and a cross-modal reasoning module to effectively integrate audio and visual information, establishing a robust joint embedding representation. Our experimental results validate the effectiveness of this approach, demonstrating outstanding Generalized Zero-Shot Learning (GZSL) performance on the UCF101 Generalized Zero-Shot Learning (UCF-GZSL), VGGSound-GZSL, and ActivityNet-GZSL datasets, with notable improvements in the Harmonic Mean (HM) evaluation. These results indicate that TSART has great potential in handling complex spatio-temporal information and multimodal fusion.
资助项目	National Natural Science Foundation of China[62206275]
WOS研究方向	Mathematics
语种	英语
WOS记录号	WOS:001277575400001
出版者	MDPI
资助机构	National Natural Science Foundation of China
源URL	[http://ir.ia.ac.cn/handle/173211/59367]
专题	精密感知与控制研究中心_精密感知与控制
通讯作者	Tian, Yunong
作者单位	1.Beijing Forestry Univ, Sch Informat Sci & Technol, Beijing 100083, Peoples R China 2.Chinese Acad Sci, Inst Automat, CAS Engn Lab Intelligent Ind Vis, Beijing 100190, Peoples R China
推荐引用方式 GB/T 7714	Zhang, Kaiwen,Zhao, Kunchen,Tian, Yunong. Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning[J]. MATHEMATICS,2024,12(14):16.
APA	Zhang, Kaiwen,Zhao, Kunchen,&Tian, Yunong.(2024).Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning.MATHEMATICS,12(14),16.
MLA	Zhang, Kaiwen,et al."Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning".MATHEMATICS 12.14(2024):16.

入库方式： OAI收割

来源：自动化研究所

下载0

Temporal-Semantic Aligning and Reasoning Transformer for Audio-Visual Zero-Shot Learning

其他版本