|Author||Huang, Yan; Long, Yang; Wang, Liang
|Keyword||Image And Sentence Matching
Although image and sentence matching has been widely studied,
its intrinsic few-shot problem is commonly ignored,
which has became a bottleneck for further performance improvement.
In this work, we focus on the challenging problem
of few-shot image and sentence matching, and propose a Gated
Visual-Semantic Embedding (GVSE) model to deal with
it. The model consists of three corporative modules in terms
of uncommon VSE, common VSE, and gated metric fusion.
The uncommon VSE exploits external auxiliary sources to
extract generic features for describing uncommon instances
and words in images and sentences, and then integrates them
by modeling their semantic relation to obtain global representations
for association analysis. To better model the most
common instances and words in rest content of images and
sentences, the common VSE learns their discriminative representations
directly from scratch. After obtaining two similarity
metrics from the two VSE modules, the gated metric
fusion module adaptively fuses them by automatically balancing
their relative importance. Based on the fused metric,
we performance extensive experiments in terms of few-shot
and conventional image and sentence matching, and demonstrate
the effectiveness of the proposed model by achieving
the state-of-the-art results on two public benchmark datasets.
|Author of Source||Zhihua Zhou
Huang, Yan,Long, Yang,Wang, Liang. Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding[C]. 见:. Honolulu. 2019.1.27-2019.2.1.