Deep Hierarchical Encoder-Decoder Network for Image Captioning
文献类型:期刊论文
作者 | Xiao, Xinyu1,2![]() ![]() ![]() ![]() ![]() |
刊名 | IEEE TRANSACTIONS ON MULTIMEDIA
![]() |
出版日期 | 2019-11-01 |
卷号 | 21期号:11页码:2942-2956 |
关键词 | Visualization Semantics Hidden Markov models Decoding Logic gates Training Computer architecture Deep hierarchical structure encoder-decoder LSTM image captioning retrieval vision-sentence |
ISSN号 | 1520-9210 |
DOI | 10.1109/TMM.2019.2915033 |
通讯作者 | Wang, Lingfeng(lfwang@nlpr.ia.ac.cn) |
英文摘要 | Encoder-decoder models have been widely used in image captioning, and most of them are designed via single long short term memory (LSTM). The capacity of single-layer network, whose encoder and decoder are integrated together, is limited for such a complex task of image captioning. Moreover, how to effectively increase the "vertical depth" of encoder-decoder remains to be solved. To deal with these problems, a novel deep hierarchical encoder-decoder network is proposed for image captioning, where a deep hierarchical structure is explored to separate the functions of encoder and decoder. This model is capable of efficiently exerting the representation capacity of deep networks to fuse high level semantics of vision and language in generating captions. Specifically, visual representations in top levels of abstraction are simultaneously considered, and each of these levels is associated to one LSTM. The bottom-most LSTM is applied as the encoder of textual inputs. The application of the middle layer in encoder-decoder is to enhance the decoding ability of top-most LSTM. Furthermore, depending on the introduction of semantic enhancement module of image feature and distribution combine module of text feature, variants of architectures of our model are constructed to explore the impacts and mutual interactions among the visual representation, textual representations, and the output of the middle LSTM layer. Particularly, the framework is training under a reinforcement learning method to address the exposure bias problem between the training and the testing by the policy gradient optimization. Qualitative analyses indicate the process that our model "translates" image to sentence and further visualization presents the evolution of the hidden states from different hierarchical LSTMs over time. Extensive experiments demonstrate that our model outperforms current state-of-the-art models on three benchmark datasets: Flickr8K, Flickr30K, and MSCOCO. On both image captioning and retrieval tasks, our method achieves the best results. On MSCOCO captioning Leaderboard, our method also achieves superior performance. |
资助项目 | National Natural Science Foundation of China[91646207] ; National Natural Science Foundation of China[61773377] ; National Natural Science Foundation of China[61573352] ; Beijing Natural Science Foundation[L172053] |
WOS研究方向 | Computer Science ; Telecommunications |
语种 | 英语 |
WOS记录号 | WOS:000494363000020 |
出版者 | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC |
资助机构 | National Natural Science Foundation of China ; Beijing Natural Science Foundation |
源URL | [http://ir.ia.ac.cn/handle/173211/28920] ![]() |
专题 | 自动化研究所_模式识别国家重点实验室_遥感图像处理团队 |
通讯作者 | Wang, Lingfeng |
作者单位 | 1.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China 2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China |
推荐引用方式 GB/T 7714 | Xiao, Xinyu,Wang, Lingfeng,Ding, Kun,et al. Deep Hierarchical Encoder-Decoder Network for Image Captioning[J]. IEEE TRANSACTIONS ON MULTIMEDIA,2019,21(11):2942-2956. |
APA | Xiao, Xinyu,Wang, Lingfeng,Ding, Kun,Xiang, Shiming,&Pan, Chunhong.(2019).Deep Hierarchical Encoder-Decoder Network for Image Captioning.IEEE TRANSACTIONS ON MULTIMEDIA,21(11),2942-2956. |
MLA | Xiao, Xinyu,et al."Deep Hierarchical Encoder-Decoder Network for Image Captioning".IEEE TRANSACTIONS ON MULTIMEDIA 21.11(2019):2942-2956. |
入库方式: OAI收割
来源:自动化研究所
浏览0
下载0
收藏0
其他版本
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。