中国科学院机构知识库网格系统: DTLLM-VLT: Diverse Text Generation for Visual Language Tracking

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

DTLLM-VLT: Diverse Text Generation for Visual Language Tracking

文献类型：会议论文


作者	Xuchen Li 4; Xiaokun Feng 3,4; Shiyu Hu3,4 ; Meiqi Wu 2; Dailing Zhang 3,4; Jing Zhang4 ; Kaiqi Huang1,3,4
出版日期	2024
会议日期	2024年6月17日-2024年6月18日
会议地点	西雅图
英文摘要	Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video, for the precise tracking of a specified object. By leveraging high-level semantic information, VLT guides object tracking, alleviating the constraints associated with relying on a visual modality. Nevertheless, most VLT bench marks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover, coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges, we introduce DTLLM-VLT, which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless integration into various visual tracking benchmarks. (2) We select three prominent benchmarks to deploy our approach: short-term tracking, long-term tracking, and global instance tracking. We offer four granularity combinations for these benchmarks, considering the extent and density of semantic information, thereby showcasing the practicality and versatility of DTLLM-VLT. (3) We conduct comparative experiments on VLT benchmarks with different text granularities, evaluating and analyzing the impact of diverse text on tracking performance. Conclusionally, this work leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives, enabling finegrained evaluation of multi-modal trackers. In the future, we believe this work can be extended to more datasets to support vision datasets understanding
源URL	[http://ir.ia.ac.cn/handle/173211/59481]
专题	智能系统与工程
作者单位	1.CAS Center for Excellence in Brain Science and Intelligence Technology 2.School of Computer Science and Technology, University of Chinese Academy of Sciences 3.School of Artificial Intelligence, University of Chinese Academy of Sciences 4.CRISE, Institute of Automation, Chinese Academy of Sciences
推荐引用方式 GB/T 7714	Xuchen Li,Xiaokun Feng,Shiyu Hu,et al. DTLLM-VLT: Diverse Text Generation for Visual Language Tracking[C]. 见:. 西雅图. 2024年6月17日-2024年6月18日.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。