中国科学院机构知识库网格
Chinese Academy of Sciences Institutional Repositories Grid
Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

文献类型:期刊论文

作者Haoyu Lu2
刊名Machine Intelligence Research
出版日期2023
卷号20期号:4页码:569-582
ISSN号2731-538X
关键词Image-text retrieval, multimodal modeling, contrastive learning, weak correlation, computer vision
DOI10.1007/s11633-022-1386-4
英文摘要Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient.
源URL[http://ir.ia.ac.cn/handle/173211/52350]  
专题自动化研究所_学术期刊_International Journal of Automation and Computing
作者单位1.The University of Hong Kong, Hong Kong 999077, China
2.Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China
推荐引用方式
GB/T 7714
Haoyu Lu. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval[J]. Machine Intelligence Research,2023,20(4):569-582.
APA Haoyu Lu.(2023).Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval.Machine Intelligence Research,20(4),569-582.
MLA Haoyu Lu."Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval".Machine Intelligence Research 20.4(2023):569-582.

入库方式: OAI收割

来源:自动化研究所

浏览0
下载0
收藏0
其他版本

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。