中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

文献类型:会议论文

作者Yunan Zeng; Yan Huang; Jinjin Zhang; Zequn Jie; Zhenhua Chai; Liang Wang
出版日期2024-06-18
会议日期17-21 June 2024
会议地点Seattle WA, USA
英文摘要

Pre-trained vision-language models (VLMs) have achieved high performance on various downstream tasks, which have been widely used for visual grounding tasks in a weakly supervised manner. However, despite the performance gains contributed by large vision and language pre-training, we find that state-of-the-art VLMs struggle with compositional reasoning on grounding tasks. To demonstrate this, we propose Attribute, Relation, and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks. ARPGrounding contains 11,425 samples and evaluates the compositional understanding of VLMs in three dimensions: 1) attribute, denoting comprehension of objects' properties; 2) relation, indicating an understanding of relation between objects; 3) priority, reflecting an awareness of the part of speech associated with nouns. Using the ARPGrounding benchmark, we evaluate several mainstream VLMs. We empirically find that these models perform quite well on conventional visual grounding datasets, achieving performance comparable to or surpassing state-of-the-art methods but showing strong deficiencies in compositional reasoning. Furthermore, we propose a composition-aware fine-tuning pipeline, demonstrating the potential to leverage cost-effective image-text annotations for enhancing the compositional understanding of VLMs in grounding tasks.

源URL[http://ir.ia.ac.cn/handle/173211/57210]  
专题自动化研究所_智能感知与计算研究中心
通讯作者Liang Wang
作者单位1.Meituan
2.Institute of Automation, Chinese Academy of Science
3.Center for Research on Intelligent Perception and Computing
4.School of Artificial Intelligence, University of Chinese Academy of Sciences
推荐引用方式
GB/T 7714
Yunan Zeng,Yan Huang,Jinjin Zhang,et al. Investigating Compositional Challenges in Vision-Language Models for Visual Grounding[C]. 见:. Seattle WA, USA. 17-21 June 2024.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。