Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval
文献类型:期刊论文
作者 | Haoyu Lu2 |
刊名 | Machine Intelligence Research |
出版日期 | 2023 |
卷号 | 20期号:4页码:569-582 |
ISSN号 | 2731-538X |
关键词 | Image-text retrieval, multimodal modeling, contrastive learning, weak correlation, computer vision |
DOI | 10.1007/s11633-022-1386-4 |
英文摘要 | Cross-modal image-text retrieval is a fundamental task in bridging vision and language. It faces two main challenges that are typically not well addressed in previous works. 1) Generalizability: Existing methods often assume a strong semantic correlation between each text-image pair, which are thus difficult to generalize to real-world scenarios where the weak correlation dominates. 2) Efficiency: Many latest works adopt the single-tower architecture with heavy detectors, which are inefficient during the inference stage because the costly computation needs to be repeated for each text-image pair. In this work, to overcome these two challenges, we propose a two-tower cross-modal contrastive learning (CMCL) framework. Specifically, we first devise a two-tower architecture, which enables a unified feature space for the text and image modalities to be directly compared with each other, alleviating the heavy computation during inference. We further introduce a simple yet effective module named multi-grid split (MGS) to learn fine-grained image features without using detectors. Last but not the least, we deploy a cross-modal contrastive loss on the global image/text features to learn their weak correlation and thus achieve high generalizability. To validate that our CMCL can be readily generalized to real-world scenarios, we construct a large multi-source image-text dataset called weak semantic correlation dataset (WSCD). Extensive experiments show that our CMCL outperforms the state-of-the-arts while being much more efficient. |
源URL | [http://ir.ia.ac.cn/handle/173211/52350] |
专题 | 自动化研究所_学术期刊_International Journal of Automation and Computing |
作者单位 | 1.The University of Hong Kong, Hong Kong 999077, China 2.Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China |
推荐引用方式 GB/T 7714 | Haoyu Lu. Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval[J]. Machine Intelligence Research,2023,20(4):569-582. |
APA | Haoyu Lu.(2023).Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval.Machine Intelligence Research,20(4),569-582. |
MLA | Haoyu Lu."Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval".Machine Intelligence Research 20.4(2023):569-582. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。