中国科学院机构知识库网格系统: Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

中国科学院机构知识库网格

Chinese Academy of Sciences Institutional Repositories Grid

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

文献类型：期刊论文


作者	Haoyu Lu 2
刊名	Machine Intelligence Research
出版日期	2023
卷号	20 期号:4 页码:569-582
ISSN号	2731-538X
关键词	Image-text retrieval, multimodal modeling, contrastive learning, weak correlation, computer vision
DOI	10.1007/s11633-022-1386-4
英文摘要	Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.
源URL	[http://ir.ia.ac.cn/handle/173211/52350]
专题	自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位	1.The University of Hong Kong, Hong Kong 999077, China 2.Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
推荐引用方式 GB/T 7714	Haoyu Lu. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval[J]. Machine Intelligence Research,2023,20(4):569-582.
APA	Haoyu Lu.(2023).Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval.Machine Intelligence Research,20(4),569-582.
MLA	Haoyu Lu."Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval".Machine Intelligence Research 20.4(2023):569-582.

入库方式： OAI收割

来源：自动化研究所

浏览0

下载0

收藏0

其他版本

除非特别说明，本系统中所有内容都受版权保护，并保留所有权利。